Node and cluster life-cycle in ignite-3

2021-06-01 Thread Sergey Chugunov
 Hello Igniters,

I would like to start a discussion on evolving IEP-73 [1]. Now it covers a
narrow topic about components dependencies but it makes sense to cover in
the IEP a broader question: how different components should be initialized
to support different modes of an individual node or a whole cluster.

There is an idea to borrow the notion of run-levels from Unix-like systems,
and I suggest the following design to implement it.

   1. To start and function at a specific run-level node needs to start and
   initialize components in a proper order. During initialization components
   may need to notify each other about reaching a particular run-level so
   other components are able to execute their actions. Orchestrating of this
   process should be a responsibility of a new component.

   2. Orchestration component doesn't manage the initialization process
   directly but uses another abstraction called scenario. Examples of
   run-levels in the context of Ignite 2.x may be Maintenance Mode,
   INACTIVE-READONLY-ACTIVE states of a cluster, and each level is reached
   when a corresponding scenario has executed.

   So the responsibility of the orchestrator will be managing scenarios and
   providing them with infrastructure of spreading notification events between
   components. All low-level details and knowledge of existing components and
   their dependencies are encapsulated inside scenarios.

   3. Scenarios allow nesting, e.g. a scenario for INACTIVE cluster state
   can be "upgraded" to READONLY state by executing diff between INACTIVE and
   READONLY scenarios.


I see several advantages of this design compared to existing model in
Ignite 2.x (mostly implemented in IgniteKernal and based on two main
methods: start and onKernalStart):

   1. More flexible model allows implementing more diverse run-levels for
   different needs (already mentioned Maintenance Mode, cluster state modes
   like ACTIVE-INACTIVE and smart strategies for cache warmup on node start).

   2. Knowledge of components and their dependencies is encapsulated inside
   scenarios which makes it easier to create new scenarios.


Open questions:

   1. As I see right now it is hard to standardize initialization events
   components notify each other with.

   2. It is not clear if run-levels should be organized into one rigid
   hierarchy (when the first run-level should always precede the second and so
   on) or they should be more independent.


What do you think?

[1]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-73%3A+Node+startup


Re: Fix force rebuild indexes

2021-03-25 Thread Sergey Chugunov
Kirill,

Indeed current behavior of force rebuild API seems broken, we need to fix
it, +1 from me too.

BTW would it be useful to allow rebuilding individual indices?

On Wed, Mar 24, 2021 at 6:20 PM ткаленко кирилл 
wrote:

> Hello!
>
> What do you mean by the implementation plan?
> Implement ticket https://issues.apache.org/jira/browse/IGNITE-14321
>
> 24.03.2021, 17:17, "Maxim Muzafarov" :
> > Hello,
> >
> > I think the issue definitely must be fixed, so +1 from my side.
> > BTW, what would be your implementation plan?
> >
> > I think the [1] issue may be interesting for you.
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-13056
> >
> > On Tue, 23 Mar 2021 at 21:04, ткаленко кирилл 
> wrote:
> >>  Hello everyone!
> >>
> >>  I found that a forced rebuild of indexes does not work correctly. If
> the indexes were rebuilt once, then nothing will happen each time a forced
> rebuild is attempted. Also, if during the first rebuild of indexes (before
> the checkpoint) we call a forced rebuild of indexes, then we will execute
> it sequentially after the first. It seems that we need to fix this.
> >>
> >>  I suggest not to allow (throw an exception) to start a forced rebuild
> of indexes until the previous one completes.
> >>  And, of course, fix the ability to launch a forced rebuild of indexes.
> >>
> >>  I want to do this on ticket
> https://issues.apache.org/jira/browse/IGNITE-14321.
> >>
> >>  Sorry, the thread was without a subject.
> >>  http://apache-ignite-developers.2346864.n4.nabble.com/-td51935.html
> >>
> >>  WDYT?
>


[jira] [Created] (IGNITE-14382) Network module API structuring

2021-03-24 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-14382:


 Summary: Network module API structuring
 Key: IGNITE-14382
 URL: https://issues.apache.org/jira/browse/IGNITE-14382
 Project: Ignite
  Issue Type: Sub-task
  Components: networking
Reporter: Sergey Chugunov
 Fix For: 3.0.0-alpha2


First version of the network module introduced a NetworkCluster interface 
providing access to all functionality of the module: sending and receiving 
messages in p2p fashion, topology API (current set of online nodes, node join 
and left events) and some lifecycle-related methods.

Further development has shown that it makes sense to gather these pieces of 
functionality under separate interfaces that should be accessible from 
NetworkCluster or similar single entry point.

Suggestions for naming of these interfaces: *Topology* and *Messaging*. Keeping 
lifecycle callbacks and methods under the same interface seems natural at the 
moment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-14323) Messaging naming unification

2021-03-16 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-14323:


 Summary: Messaging naming unification
 Key: IGNITE-14323
 URL: https://issues.apache.org/jira/browse/IGNITE-14323
 Project: Ignite
  Issue Type: Sub-task
  Components: networking
Reporter: Sergey Chugunov
 Fix For: 3.0.0-alpha2


Naming of methods for message sending in NetworkCluster interface could be 
unified.

# *send* method returning CompletableFuture with semantics "send message and 
wait when remote node replies with result".
# *sendNoAck* method returning void with semantics "send message to remote node 
and returns immediately when message is sent to it (written to output 
connection)"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-14297) API to unregister HandlerProvider from network module

2021-03-10 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-14297:


 Summary: API to unregister HandlerProvider from network module
 Key: IGNITE-14297
 URL: https://issues.apache.org/jira/browse/IGNITE-14297
 Project: Ignite
  Issue Type: Sub-task
Reporter: Sergey Chugunov
 Fix For: 3.0.0-alpha2


At the moment client components can register HandlerProviders in network 
component but cannot unregister them.

However this could be important in component lifecycle to properly stop the 
component.

API to unregister handler from the network with clear contract about possible 
races (one thread unregisteres component's handler, another thread sends a 
message from the same component) should be implemented.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-14296) Classe

2021-03-10 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-14296:


 Summary: Classe
 Key: IGNITE-14296
 URL: https://issues.apache.org/jira/browse/IGNITE-14296
 Project: Ignite
  Issue Type: Sub-task
Reporter: Sergey Chugunov


Classes' names in network module are self-explanatory and don't need a special 
prefix, it could be removed to make the code more compact.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-14295) Message interface to be introduced

2021-03-10 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-14295:


 Summary: Message interface to be introduced
 Key: IGNITE-14295
 URL: https://issues.apache.org/jira/browse/IGNITE-14295
 Project: Ignite
  Issue Type: Sub-task
Reporter: Sergey Chugunov


Network module should introduce a public Message interface to handle messages 
to send and receive.

This interface should provide at least information about message type (and 
possible version) to enable effective serialization/deserialization and ability 
to subscribe for a messages of certain type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-14231) IGNITE_ENABLE_FORCIBLE_NODE_KILL flag is not supported in inverse connection request scenario

2021-02-24 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-14231:


 Summary: IGNITE_ENABLE_FORCIBLE_NODE_KILL flag is not supported in 
inverse connection request scenario
 Key: IGNITE-14231
 URL: https://issues.apache.org/jira/browse/IGNITE-14231
 Project: Ignite
  Issue Type: Bug
Affects Versions: 2.9.1
Reporter: Sergey Chugunov
Assignee: Sergey Chugunov
 Fix For: 2.11


IGNITE_ENABLE_FORCIBLE_NODE_KILL flag enables server nodes to forcibly kill 
clients visible via Discovery but unreachable by Communication protocol.

This leads to infinite loops when server tries to establish communication 
connection to unreachable client fails and tries again effectively ignoring the 
flag.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-14184) API for off-line update of configuration

2021-02-15 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-14184:


 Summary: API for off-line update of configuration
 Key: IGNITE-14184
 URL: https://issues.apache.org/jira/browse/IGNITE-14184
 Project: Ignite
  Issue Type: Sub-task
Reporter: Sergey Chugunov


Tools like new CLI may include ability to view/change existing configuration 
without starting Ignite nodes.
This may also be useful in Ignite version upgrade scenarios.

Configuration module should support this case with all validations and other 
functionality.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-14183) Cross-root validation

2021-02-15 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-14183:


 Summary: Cross-root validation
 Key: IGNITE-14183
 URL: https://issues.apache.org/jira/browse/IGNITE-14183
 Project: Ignite
  Issue Type: Sub-task
Reporter: Sergey Chugunov


Current validation works only inside one configuration root but it is possible 
that properties from one root depend on properties from another.

Cross-root validation should be implemented to take these cases into account.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-14182) NamedList remove improvements

2021-02-15 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-14182:


 Summary: NamedList remove improvements
 Key: IGNITE-14182
 URL: https://issues.apache.org/jira/browse/IGNITE-14182
 Project: Ignite
  Issue Type: Sub-task
Reporter: Sergey Chugunov


>From API perspective to remove from NamedList we need to nullify a particular 
>element in the list.

On the Storage level it turns into removing all keys sitting under this 
particular element.

Configuration engine should be responsible for cleaning up all necessary keys 
from Storage. Notifications should be aware of removing from NamedLists as well 
(e.g. one notification about removing the element from NL instead of bunch of 
notifications about each NL's element's field).






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-14181) Configuration to support arrays of primitive types

2021-02-15 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-14181:


 Summary: Configuration to support arrays of primitive types
 Key: IGNITE-14181
 URL: https://issues.apache.org/jira/browse/IGNITE-14181
 Project: Ignite
  Issue Type: Sub-task
Reporter: Sergey Chugunov


Configuration should support declaring arrays of primitive types (e.g. arrays 
of addresses in IpFinder).

Only primitive types are needed, for user types NamedLists should be used 
instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-14180) Storage notification API

2021-02-15 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-14180:


 Summary: Storage notification API
 Key: IGNITE-14180
 URL: https://issues.apache.org/jira/browse/IGNITE-14180
 Project: Ignite
  Issue Type: Sub-task
Reporter: Sergey Chugunov


Local (and in the future global) Storage should support notification mechanism: 
all interested components should be able to subscribe to notifications about 
stored keys (add, remove, update).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-14178) Asynchronous Storage API

2021-02-15 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-14178:


 Summary: Asynchronous Storage API
 Key: IGNITE-14178
 URL: https://issues.apache.org/jira/browse/IGNITE-14178
 Project: Ignite
  Issue Type: Sub-task
Reporter: Sergey Chugunov






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-14155) Test IgniteClusterIdTagTest.testInMemoryClusterTag is flaky on TC

2021-02-10 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-14155:


 Summary: Test IgniteClusterIdTagTest.testInMemoryClusterTag is 
flaky on TC
 Key: IGNITE-14155
 URL: https://issues.apache.org/jira/browse/IGNITE-14155
 Project: Ignite
  Issue Type: Test
Reporter: Sergey Chugunov
Assignee: Sergey Chugunov


History of the test is available 
[here|https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=2444565365384645281=testDetails_IgniteTests24Java8=%3Cdefault%3E].

This test is flaky but the problem is in the test itself as it synchronously 
asserts a condition that is intrinsically asynchronous.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSSION] Unified Configuration for Ignite 3.0

2020-12-17 Thread Sergey Chugunov
Val,

Together with Semyon Danilov I did final polishing of code and merged it to
the main branch in ignite-3 repo.

Code assembles without any issues, tests are green, IgniteRunner starts and
serves REST requests successfully.

On Tue, Dec 15, 2020 at 10:37 PM Valentin Kulichenko <
valentin.kuliche...@gmail.com> wrote:

> Thanks, Sergey! Looks good to me.
>
> -Val
>
> On Tue, Dec 15, 2020 at 12:12 AM Sergey Chugunov <
> sergey.chugu...@gmail.com>
> wrote:
>
> > Val,
> >
> > Your comments make total sense to me, I've fixed them and updated pull
> > request. Please take a look at my code when you have time.
> >
> > I also added a port range configuration to enable starting multiple
> > instances of ignite without specifying port manually for each instance.
> >
> > --
> > Best Regards,
> > Sergey Chugunov
> >
> > On Sat, Dec 12, 2020 at 3:20 AM Valentin Kulichenko <
> > valentin.kuliche...@gmail.com> wrote:
> >
> > > Hi Sergey,
> > >
> > > Thanks for doing this.
> > >
> > > It looks like PR #5 is already under review, so I guess it will be
> merged
> > > soon. I would really love to see that, because the configuration
> > framework
> > > is one of the foundational components - we need it to continue building
> > > Ignite 3.0.
> > >
> > > As for PR #6, it looks a little raw, but I believe we need it to
> connect
> > > the configuration framework with the CLI tool that is also pending for
> > the
> > > merge, is this correct? If that's the case, I think it's OK to merge
> this
> > > code as a separate module, with an understanding that it will change
> > > significantly down the road. I would do a couple of changes though:
> > >
> > >1. Get rid of "simplistic-ignite" naming, as it's a little
> confusing.
> > >Even though it's more of a prototype at this point, it should be
> clear
> > > what
> > >the module is responsible for. Can we rename it to "ignite-runner"
> or
> > >something along those lines?
> > >2. Update the output - I don't like that it prints out the
> > >Javalin's banner and messages. I suggest replacing this with some
> very
> > >basic Ignite logging: an entry showing the version of Ignite; an
> entry
> > >indicating that the REST protocol is enabled on a certain port; an
> > entry
> > >that the process is successfully started. This is just to make sure
> > that
> > >anyone who plays with it understands what's going on.
> > >
> > > Any objections?
> > >
> > > -Val
> > >
> > > On Fri, Dec 11, 2020 at 9:53 AM Sergey Chugunov <
> > sergey.chugu...@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hello Igniters,
> > > >
> > > > I would like to present two pull requests [1], [2] with basic
> > > > implementation of IEP-55 for Unified Configuration [3] and IEP-63
> REST
> > > API
> > > > for Unified Configuration [4].
> > > >
> > > > The main goal of these PRs is to present and discuss a new approach
> for
> > > > preparing and managing Ignite configuration in a more robust and
> > > convenient
> > > > way than it was before.
> > > >
> > > > These PRs cover basic aspects of configuration but other steps for
> > > > developing functionality are already defined; ticket IGNITE-13511 [5]
> > > > summarizes work to do.
> > > >
> > > > In a nutshell proposed approach to configuration is as follows:
> > > >
> > > > We want to declare configuration with POJO-based schemas that are
> > concise
> > > > and contain all important information about validation and how
> > different
> > > > pieces of configuration relate to each other.
> > > > When schemas are marked with annotations annotation processor enters
> > the
> > > > game and generates most of boilerplate code thus freeing users from
> > > writing
> > > > it by hand.
> > > >
> > > > REST API module from [2] contains an example of managing
> configuration
> > > and
> > > > exposing it to external tools like a Unified CLI tool presented in
> [6].
> > > >
> > > > [1] https://github.com/apache/ignite-3/pull/5
> > > > [2] https://github.com/apache/ignite-3/pull/6
> > > > [3]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-55+Unified+Configuration
> > > > [4]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-63%3A+REST+API+module+to+integrate+new+modular+architecture+and+management
> > > > [5] https://issues.apache.org/jira/browse/IGNITE-13511
> > > > [6]
> > > >
> > > >
> > >
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/DISCUSSION-Unified-CLI-tool-td50618.html
> > > >
> > >
> >
>


Re: [DISCUSSION] Unified Configuration for Ignite 3.0

2020-12-15 Thread Sergey Chugunov
Val,

Your comments make total sense to me, I've fixed them and updated pull
request. Please take a look at my code when you have time.

I also added a port range configuration to enable starting multiple
instances of ignite without specifying port manually for each instance.

--
Best Regards,
Sergey Chugunov

On Sat, Dec 12, 2020 at 3:20 AM Valentin Kulichenko <
valentin.kuliche...@gmail.com> wrote:

> Hi Sergey,
>
> Thanks for doing this.
>
> It looks like PR #5 is already under review, so I guess it will be merged
> soon. I would really love to see that, because the configuration framework
> is one of the foundational components - we need it to continue building
> Ignite 3.0.
>
> As for PR #6, it looks a little raw, but I believe we need it to connect
> the configuration framework with the CLI tool that is also pending for the
> merge, is this correct? If that's the case, I think it's OK to merge this
> code as a separate module, with an understanding that it will change
> significantly down the road. I would do a couple of changes though:
>
>1. Get rid of "simplistic-ignite" naming, as it's a little confusing.
>Even though it's more of a prototype at this point, it should be clear
> what
>the module is responsible for. Can we rename it to "ignite-runner" or
>something along those lines?
>2. Update the output - I don't like that it prints out the
>Javalin's banner and messages. I suggest replacing this with some very
>basic Ignite logging: an entry showing the version of Ignite; an entry
>indicating that the REST protocol is enabled on a certain port; an entry
>that the process is successfully started. This is just to make sure that
>anyone who plays with it understands what's going on.
>
> Any objections?
>
> -Val
>
> On Fri, Dec 11, 2020 at 9:53 AM Sergey Chugunov  >
> wrote:
>
> > Hello Igniters,
> >
> > I would like to present two pull requests [1], [2] with basic
> > implementation of IEP-55 for Unified Configuration [3] and IEP-63 REST
> API
> > for Unified Configuration [4].
> >
> > The main goal of these PRs is to present and discuss a new approach for
> > preparing and managing Ignite configuration in a more robust and
> convenient
> > way than it was before.
> >
> > These PRs cover basic aspects of configuration but other steps for
> > developing functionality are already defined; ticket IGNITE-13511 [5]
> > summarizes work to do.
> >
> > In a nutshell proposed approach to configuration is as follows:
> >
> > We want to declare configuration with POJO-based schemas that are concise
> > and contain all important information about validation and how different
> > pieces of configuration relate to each other.
> > When schemas are marked with annotations annotation processor enters the
> > game and generates most of boilerplate code thus freeing users from
> writing
> > it by hand.
> >
> > REST API module from [2] contains an example of managing configuration
> and
> > exposing it to external tools like a Unified CLI tool presented in [6].
> >
> > [1] https://github.com/apache/ignite-3/pull/5
> > [2] https://github.com/apache/ignite-3/pull/6
> > [3]
> >
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-55+Unified+Configuration
> > [4]
> >
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-63%3A+REST+API+module+to+integrate+new+modular+architecture+and+management
> > [5] https://issues.apache.org/jira/browse/IGNITE-13511
> > [6]
> >
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/DISCUSSION-Unified-CLI-tool-td50618.html
> >
>


[DISCUSSION] Unified Configuration for Ignite 3.0

2020-12-11 Thread Sergey Chugunov
Hello Igniters,

I would like to present two pull requests [1], [2] with basic
implementation of IEP-55 for Unified Configuration [3] and IEP-63 REST API
for Unified Configuration [4].

The main goal of these PRs is to present and discuss a new approach for
preparing and managing Ignite configuration in a more robust and convenient
way than it was before.

These PRs cover basic aspects of configuration but other steps for
developing functionality are already defined; ticket IGNITE-13511 [5]
summarizes work to do.

In a nutshell proposed approach to configuration is as follows:

We want to declare configuration with POJO-based schemas that are concise
and contain all important information about validation and how different
pieces of configuration relate to each other.
When schemas are marked with annotations annotation processor enters the
game and generates most of boilerplate code thus freeing users from writing
it by hand.

REST API module from [2] contains an example of managing configuration and
exposing it to external tools like a Unified CLI tool presented in [6].

[1] https://github.com/apache/ignite-3/pull/5
[2] https://github.com/apache/ignite-3/pull/6
[3]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-55+Unified+Configuration
[4]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-63%3A+REST+API+module+to+integrate+new+modular+architecture+and+management
[5] https://issues.apache.org/jira/browse/IGNITE-13511
[6]
http://apache-ignite-developers.2346864.n4.nabble.com/DISCUSSION-Unified-CLI-tool-td50618.html


[jira] [Created] (IGNITE-13718) REST API to manage configuration

2020-11-17 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-13718:


 Summary: REST API to manage configuration
 Key: IGNITE-13718
 URL: https://issues.apache.org/jira/browse/IGNITE-13718
 Project: Ignite
  Issue Type: Sub-task
Reporter: Sergey Chugunov


Application developed in IGNITE-13712 should expose REST API for managing 
configuration and integrate with command-line tool prototype from IGNITE-13610.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-13712) Simple application integrating dynamic configuration

2020-11-17 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-13712:


 Summary: Simple application integrating dynamic configuration
 Key: IGNITE-13712
 URL: https://issues.apache.org/jira/browse/IGNITE-13712
 Project: Ignite
  Issue Type: Sub-task
Reporter: Sergey Chugunov
Assignee: Sergey Chugunov


Apache Ignite node and cluster configurations include many use-cases of varying 
complexity.

To explore how different use-cases work based on new dynamic configuration a 
sample application needs to be developed.

The application should support basic and more complicated configurations, exact 
list of configurations will be provided later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Ignite 3.0 development approach

2020-11-16 Thread Sergey Chugunov
Igniters,

I agree that create or not create is not a question, rephrasing
Shakespeare.

My main point is that developing new features on top of old 2.x-style
architecture is a bad idea. We will write the code and spend some time
stabilizing it (which is expected and fine). But then, when we finally
decide to fix our architecture and pay our (already huge) technical debt,
we will have to rewrite this code again and spend time stabilizing it again.

Creating new components on top of 2.x (which is actually 1.x, nothing
fundamentally new was introduced in terms of architecture) is equal to
wasting time now and creating more worthless work for the future.

Earlier I suggested to rank all new features according to their criticality
and amount of breaking changes and shape 3.0 scope based on this analysis.
Let's get back to this idea and prepare a scope based on publicly shared
arguments.

One more thing I would add here. Our users are smart people and make
decisions about upgrading or not upgrading to a new version based on
cost/value balance. Incremental approach keeps cost (public API breaking
changes) high but brings questionable amounts of value with each iteration.
If we add more valuable features to 3.0 and force users to pay the cost
only once they will be happier than if we split really needed changes to
several major releases and send our users to hell of endless rewriting
their codebases. In the latter case we'll see users to be much more
reluctant to upgrade to newer versions.

Hope this makes sense.

On Mon, Nov 16, 2020 at 2:24 PM Nikolay Izhikov  wrote:

> > Let's indeed focus on Sergey's suggestions on the design->development
> approach.
>
> +1
>
> >   - API & configuration cleanup
> >   - New management tool
> >   - Schema-first approach
> >   - New replication infrastructure
>
> +1.
>
> > 16 нояб. 2020 г., в 13:40, Alexey Goncharuk 
> написал(а):
> >
> > Folks,
> >
> > I think we are overly driven away by the phrase 'new repo' rather than
> the
> > essence of my suggestion. We can keep developing in the same repo, we can
> > even keep developing in the master branch. My point is that Ignite 3.0
> is a
> > chance to move on with the architecture, so if we really want to make
> > architectural improvements, we should not strive for incremental changes
> > for *some parts of the code*.
> >
> > Maxim,
> >
> > To comment on your examples: I think that the huge effort that is
> currently
> > required to make any significant change in Ignite is the perfect example
> of
> > how we lack structure in the codebase. Yes, theoretically we can
> introduce
> > incremental changes in the code that will improve the structure, but my
> > question is: we did not do it before, what will enforce us to make these
> > changes now? With the current approach, adding a new feature increases
> the
> > test time non-linearly because without proper decoupling you have to test
> > all possible combinations of features together. We can move faster than
> > that.
> >
> > I also do not agree that we should reduce the scope of Ignite 3.0 that
> > much. I do not see how the schema-first approach can be properly and
> > reliably implemented without a reliable HA metastorage, which in turn
> > requires a reliable replication protocol to be implemented. Besides, if a
> > number of people want to work on some Ignite feature, why should they
> wait
> > because not all community members have time to review the changes?
> >
> > Let's indeed focus on Sergey's suggestions on the design->development
> > approach. I back both Nikolay's and Maxim's scope, but I think we should
> > unite them, not intersect, and the minimal list of changes to be included
> > to Ignite 3.0 is:
> >
> >   - API & configuration cleanup
> >   - New management tool
> >   - Schema-first approach
> >   - New replication infrastructure
> >
> > Any smaller subset of changes will leave Ignite 3.0 in a transient state
> > with people being too afraid to move to it because there are more major
> > breaking changes scheduled.
> >
> > пт, 13 нояб. 2020 г. в 18:28, Alexey Zinoviev :
> >
> >> I'm -1 for creating a new repo.
> >> Also I support Maxim's plan for 3.0
> >>
> >> пт, 13 нояб. 2020 г. в 15:50, Maxim Muzafarov :
> >>
> >>> Val,
> >>>
> >>>
> >>> Why *creating a new repo* is the main point we faced with? Would it be
> >>> better to discuss the components design approach and scope management
> >>> first suggested by Sergey Chugunov? I doubt that new repo will solve
> >>> move us fo

Re: [DISCUSS] Ignite 3.0 development approach

2020-11-10 Thread Sergey Chugunov
Igniters,

I thought over Friday meeting ideas and concerns and summarized them in
these three points:


   1. *Components design unification approach.* New proposed components
   will be developed by different contributors, but they need to be unified
   and should integrate with each other easily. To ensure that I suggest
   calling an architecture group that will create design guidelines for all
   components and high-level overview of overall architecture. How code is
   split into components, what are component boundaries, how component
   lifecycle works and what are its interfaces - all these and other questions
   should be covered.

   2. *Scope management.* Apache 3.0 should be implemented within a
   reasonable time, so we need some procedure to decide whether a particular
   feature should be dropped from the scope of 3.0 and postponed to 3.1
   release. To do so I suggest to range all features by two parameters:
   criticality for 3.0 and amount of breaking changes. 3.0 scope should
   include features of high criticality AND features with a big amount of
   breaking changes. All other features can be made optional.

   3. *Development transparency.* Development of all components should be
   made as transparent for everyone as possible. Any contributor should be
   able to look over any component at any stage of development. To achieve
   this I suggest to create a separate public repository dedicated for 3.0
   development. It will make the code available for everyone but when
   development of 3.0 is done we won't loose any stars of our current
   repository as we merge dev repo into main one and drop dev.

Do these ideas make sense to you? Are there any concerns not covered by
these suggestions?

On Fri, Nov 6, 2020 at 7:36 PM Kseniya Romanova 
wrote:

> Here are the slides from Alexey Goncharuk. Let's think this over and
> continue on Monday:
>
> https://go.gridgain.com/rs/491-TWR-806/images/Ignite_3_Plans_and_development_process.pdf
>
> чт, 5 нояб. 2020 г. в 11:13, Anton Vinogradov :
>
> > Folks,
> >
> > Should we perform cleanup work before (r)evolutional changes?
> > My huge proposal is to get rid of things which we don't need anyway
> > - local caches,
> > - strange tx modes,
> > - code overcomplexity because of RollingUpgrade feature never attended at
> > AI,
> > - etc,
> > before choosing the way.
> >
> > On Tue, Nov 3, 2020 at 3:31 PM Valentin Kulichenko <
> > valentin.kuliche...@gmail.com> wrote:
> >
> > > Ksenia, thanks for scheduling this on such short notice!
> > >
> > > As for the original topic, I do support Alexey's idea. We're not going
> to
> > > rewrite anything from scratch, as most of the components are going to
> be
> > > moved as-is or with minimal modifications. However, the changes that
> are
> > > proposed imply serious rework of the core parts of the code, which are
> > not
> > > properly decoupled from each other and from other parts. This makes the
> > > incremental approach borderline impossible. Developing in a new repo,
> > > however, addresses this concern. As a bonus, we can also refactor the
> > code,
> > > introduce better decoupling, get rid of kernel context, and develop
> unit
> > > tests (finally!).
> > >
> > > Basically, this proposal only affects the *process*, not the set of
> > changes
> > > we had discussed before. Ignite 3.0 is our unique chance to make things
> > > right.
> > >
> > > -Val
> > >
> > > On Tue, Nov 3, 2020 at 3:06 AM Kseniya Romanova <
> > romanova.ks@gmail.com
> > > >
> > > wrote:
> > >
> > > > Pavel, all the interesting points will be anyway published here in
> > > English
> > > > (as the principal "if it's not on devlist it doesn't happened" is
> still
> > > > relevant). This is just a quick call for a group of developers. Later
> > we
> > > > can do a separate presentation of idea and discussion in English as
> we
> > > did
> > > > for the Ignite 3.0 draft of changes.
> > > >
> > > > вт, 3 нояб. 2020 г. в 13:52, Pavel Tupitsyn :
> > > >
> > > > > Kseniya,
> > > > >
> > > > > Thanks for scheduling this call.
> > > > > Do you think we can switch to English if non-Russian speaking
> > community
> > > > > members decide to join?
> > > > >
> > > > > On Tue, Nov 3, 2020 at 1:32 PM Kseniya Romanova <
> > > > romanova.ks@gmail.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Let's do this community discussion open. Here's the link on zoom
> > call
> > > > in
> > > > > > Russian for Friday 6 PM:
> > > > > >
> > https://www.meetup.com/Moscow-Apache-Ignite-Meetup/events/274360378/
> > > > > >
> > > > > > вт, 3 нояб. 2020 г. в 12:49, Nikolay Izhikov <
> nizhi...@apache.org
> > >:
> > > > > >
> > > > > > > Time works for me.
> > > > > > >
> > > > > > > > 3 нояб. 2020 г., в 12:40, Alexey Goncharuk <
> > > > > alexey.goncha...@gmail.com
> > > > > > >
> > > > > > > написал(а):
> > > > > > > >
> > > > > > > > Nikolay,
> > > > > > > >
> > > > > > > > I am up for the call. I will try to explain my reasoning in
> > > greater
> > > 

[jira] [Created] (IGNITE-13674) Document Persistent store defragmentation

2020-11-04 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-13674:


 Summary: Document Persistent store defragmentation
 Key: IGNITE-13674
 URL: https://issues.apache.org/jira/browse/IGNITE-13674
 Project: Ignite
  Issue Type: Sub-task
Reporter: Sergey Chugunov
Assignee: Sergey Chugunov






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: IEP-53 Maintenance Mode: request for review

2020-10-09 Thread Sergey Chugunov
Hi Pavel,

Thanks, I looked through your comments and fixed them. Could you please
check one more time?

On Fri, Oct 9, 2020 at 10:27 AM Pavel Tupitsyn  wrote:

> Hello Sergey,
>
> I went over the public API changes briefly and left some minor comments on
> GitHub
>
> Thanks,
> Pavel
>
> On Fri, Oct 9, 2020 at 9:59 AM Sergey Chugunov 
> wrote:
>
> > Hello Igniters,
> >
> > I'm getting closer to finishing main ticket for Maintenance Mode feature
> > [1] and now working on test fixes (most likely test modifications are
> > needed).
> >
> > So I would like to ask for a review of my pull request [2] to discuss the
> > code earlier. Test status is pretty good so I expect to get a green visa
> > soon.
> >
> > Could you please take a look?
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-13366
> > [2] https://github.com/apache/ignite/pull/8325
> >
>


Re: Broken test in master: BasicIndexTest

2020-10-09 Thread Sergey Chugunov
Max,

Thanks for spotting this, great catch!

Zhenya, could you please file a ticket of at least Critical priority?

On Fri, Oct 9, 2020 at 9:24 AM Zhenya Stanilovsky
 wrote:

>
>
> Thanks Maxim, the test is correct no need for removal.
> I checked 2.9 too, but looks it all ok there. I will take a look.
> >Hi, Igniters!
> >
> >I was discovering how indexes work and found a failed test.
> >BasicIndexTest#testInlineSizeChange is broken in master and it's not a
> >flaky case [1]. But it has been failing since 25/09 only.
> >
> >I discovered that it happened after the IGNITE-13207 ticket merged
> >(Checkpointer code refactoring) [2]. I'm not sure about the expected
> >behaviour of the inline index and how checkpointer affects it. But let's
> >fix it if it is a bug or completely remove this test.
> >
> >[1]
> >
> https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=6131871779633595667=%3Cdefault%3E=testDetails
> >
> >[2]  https://issues.apache.org/jira/browse/IGNITE-13207
> >
>


IEP-53 Maintenance Mode: request for review

2020-10-09 Thread Sergey Chugunov
Hello Igniters,

I'm getting closer to finishing main ticket for Maintenance Mode feature
[1] and now working on test fixes (most likely test modifications are
needed).

So I would like to ask for a review of my pull request [2] to discuss the
code earlier. Test status is pretty good so I expect to get a green visa
soon.

Could you please take a look?

[1] https://issues.apache.org/jira/browse/IGNITE-13366
[2] https://github.com/apache/ignite/pull/8325


[jira] [Created] (IGNITE-13558) GridCacheProcessor should implement better parallelization when restoring partition states on startup

2020-10-08 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-13558:


 Summary: GridCacheProcessor should implement better 
parallelization when restoring partition states on startup
 Key: IGNITE-13558
 URL: https://issues.apache.org/jira/browse/IGNITE-13558
 Project: Ignite
  Issue Type: Improvement
  Components: persistence
Reporter: Sergey Chugunov
 Fix For: 2.10


GridCacheProcessor#restorePartitionStates method tries to employ striped pool 
to restore partition states in parallel but level of parallelization is down 
only to cache group per thread.

It is not enough and not utilizes resources effectively in case of one cache 
group much bigger than the others.

We need to parallel restore process down to individual partitions to get the 
most from the available resources and speed up node startup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-13557) Logging improvements for PDS memory restore process

2020-10-08 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-13557:


 Summary: Logging improvements for PDS memory restore process
 Key: IGNITE-13557
 URL: https://issues.apache.org/jira/browse/IGNITE-13557
 Project: Ignite
  Issue Type: Task
  Components: persistence
Reporter: Sergey Chugunov
 Fix For: 2.10


During partition state restore phase of restoring memory state from disk Ignite 
logs a lot of useful information on debug level but very little on info.
In many situations more detailed information can be useful for identification 
of performance issues but printing info about all partitions is impractical as 
it produces too much logs.

The following improvements are possible though:

# To identify any imbalance between partitions and find bigger-than-average 
partitions we should gather statistics for each partition during restore (part 
size and time it took to restore it). After restore we'll print information 
about average time and top five partitions that took the most time to restore.
# To make progress of restoring visible we should print short message with 
intermediate progress information periodically. This should be applied when 
restore starts taking too long time (e.g. if restore hasn't finished in 5 
minutes start printing progress each minute).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-13550) CLI command to execute maintenance action in corrupted PDS scenario

2020-10-07 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-13550:


 Summary: CLI command to execute maintenance action in corrupted 
PDS scenario
 Key: IGNITE-13550
 URL: https://issues.apache.org/jira/browse/IGNITE-13550
 Project: Ignite
  Issue Type: Task
  Components: control.sh
Reporter: Sergey Chugunov
Assignee: Sergey Chugunov
 Fix For: 2.10


IGNITE-13366 introduces Maintenance Mode for corrupted PDS scenario and changes 
previous behavior of automatic deletion of corrupted PDS files.

New command is needed so user is able to get information about maintenance task 
and trigger needed action.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[DISCUSSION] User-facing API for managing Maintenance Mode

2020-09-29 Thread Sergey Chugunov
Hello Ignite dev community,

As internal implementation of Maintenance Mode [1] is getting closer to
finish I want to discuss one more thing: user-facing API (I will use
control utility for examples) for managing it.

What should be managed?
When a node enters MM, it may start some automatic actions (like
defragmentation) or wait for a user to intervene and resolve the issue
(like in case of pds corruption).

So for manually triggered operations like pds cleanup after corruption we
should provide the user with a way to actually trigger the operation.
And for long-running automatic operations like defragmentation actions like
status and cancel are reasonable to implement.

At the same time Maintenance Mode is a supporting feature; it doesn't bring
any value by itself but enables implementation of other features.
Thus putting it at the center of API and build all commands around the main
"maintenance" command may not be right.

There are two alternatives - "*Big features deserve their own commands*"
and "*Everything should be unified*". Consider them.

Big features deserve their own commands
Here for each big feature we implement its own command. Defragmentation is
a big separate feature so why shouldn't it have its own commands to request
or cancel it?

Examples
*control.sh defragmentation request-for-node --nodeId 
[--caches ]* - defragmentation will be started on the
particular node after its restart.
*control.sh defragmentation status* - prints information about status
of on-going defragmentation.
*control.sh defragmentation cancel* - cancels on-going defragmentation.

Another command - "maintenance" - will be used for more generic purposes.

Examples
*control.sh maintenance list-records* - prints information about each
maintenance record (id and name of the record, parameters, description,
current status).
*control.sh maintenance record-actions --id * - prints
information about user-triggered actions available for this record (e.g.
for pds corruption record it may be "clean-corrupted-files")
*control.sh maintenance execute-action --id  --action-name
* - triggers execution of particular action and prints results.

*Pros:*

   1. Big features like defragmentation get their own commands and more
   freedom in implementing them.
   2. It is emphasized that maintenance mode is just a supporting thing and
   not a first-class feature (it is not at the center of API).

*Cons:*

   1. Duplication of functionality. The same functions may be available via
   general maintenance command and a separate command of the feature.
   2. Information about a feature may be split into two commands. One piece
   of information is available in the "feature" command, another in the
   "maintenance" command.


Everything should be unified
We can go another way and gather all features that rely on MM under one
unified command.

API for node that is already in MM looks complete and logical, very
intuitive:
*control.sh maintenance list-records* - output all records that have to
be resolved to finish maintenance.
*control.sh maintenance record-actions --id * - all actions
available for the record.
*control.sh maintenance execute-action --id  --action-name
* - executes action of the given name (like general actions
"status" or "delete" and more specific action "clean-corrupted-files" for
corrupted pds situation).

But API to request node to enter maintenance mode becomes more vague.
*control.sh maintenance available-operations* - prints all operations
available to request (for instance, defragmentation).
control.sh maintenance request-operation --id  --params
 - requests given operation to start on next node
restart.
Here we have to distinguish operations that are requested automatically
(like pds corruption) and not show them to the user.

*Pros:*

   1. Single API to get information and trigger actions without any
   duplication.


*Cons:*

   1. We restrict big features by model provided by maintenance command.
   2. In this API we put maintenance in the center although it is nothing
   more than a supporting feature.
   3. API to request maintenance operations doesn't feel intuitive to me
   but more artificial.


So what do you think? What looks better and more intuitive from your
perspective?

I will be glad to hear any feedback on the subject.

As a result of this discussion I will create a ticket for implementation
and include it into IEP-53 [2]

[1] https://issues.apache.org/jira/browse/IGNITE-13366
[2]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-53%3A+Maintenance+Mode


Re: [DISCUSSION] Maintenance Mode feature

2020-09-29 Thread Sergey Chugunov
Hello Nikolay,

> AFAIKU There is third use-case for this mode.

Sorry for the late reply.

I took a look at the code and maintenance mode indeed looks a good match
for changing master key situation.

I want to clarify only one thing. In current implementation we pass new
master key name via system property. Do you think of getting rid of this
property and passing new master key name to encryption manager with
maintenance parameters? In terms of original IEP it is parameters passed
with MaintenanceRecord.

--
Thanks!

On Mon, Sep 21, 2020 at 3:20 PM Nikolay Izhikov  wrote:

> Hello, Sergey.
>
> > At the moment I'm aware about two use cases for this feature: corrupted
> PDS cleanup and defragmentation.
>
> AFAIKU There is third use-case for this mode.
>
> Change encryption master key in case node was down during cluster master
> key change.
> In this case, node can’t join to the cluster, because it’s master key
> differs from the cluster.
> To recover node Ignite should locally change master key before join.
>
> Please, take a look into source code [1]
>
> [1]
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/managers/encryption/GridEncryptionManager.java#L710
>
> > 21 сент. 2020 г., в 14:37, Sergey Chugunov 
> написал(а):
> >
> > Ivan,
> >
> > Sorry for some confusion, MM indeed is not a normal mode. What I was
> trying
> > to say is that when in MM node still starts and allows the user to
> perform
> > actions with it like sending commands via control utility/JMX APIs or
> > reading metrics.
> >
> > This is the key point: although the node is not in the cluster but it is
> > still alive can be monitored and supports management to do maintenance.
> >
> > From  the code complexity perspective I'm trying to design the feature in
> > such a way that all maintenance code is as encapsulated as possible and
> > avoids massive interventions into main workflows of components.
> > At the moment I'm aware about two use cases for this feature: corrupted
> PDS
> > cleanup and defragmentation. As far as I know it won't bring too much
> > complexity in both cases.
> >
> > I cannot say for other components but I believe it will be possible to
> > integrate MM feature into their workflow as well with reasonable amount
> of
> > refactoring.
> >
> > Does it make sense to you?
> >
> > On Sun, Sep 6, 2020 at 8:08 AM Ivan Pavlukhin 
> wrote:
> >
> >> Sergey,
> >>
> >> Thank you for your answer!
> >>
> >> Might be I am looking at the subject from a different angle.
> >>
> >>> I think of a node in MM as an almost normal one
> >> I cannot think of such a mode as a normal one, because it apparently
> >> does not perform usual cluster node functions. It is not a part of a
> >> cluster, caches data is not available, Discovery and Communication are
> >> not needed.
> >>
> >> I fear that with "node started in a special mode" approach we will get
> >> an additional flag in the code making the code more complex and
> >> fragile. Should not I worry about it?
> >>
> >> 2020-09-02 10:45 GMT+03:00, Sergey Chugunov  >:
> >>> Vladislav, Ivan,
> >>>
> >>> Thank you for your questions and suggestions. Let me answer them.
> >>>
> >>> Vladislav,
> >>>
> >>> If I understood you correctly, you're talking about a node performing
> >> some
> >>> automatic actions to fix the problem and then join the cluster as
> usual.
> >>>
> >>> However the original ticket [1] where we faced the need for Maintenance
> >>> Mode is about exactly the opposite: avoid doing automatic actions and
> >> give
> >>> a user the ability to decide what to do.
> >>>
> >>> Also the idea of Maintenance Mode is that the node is able to accept
> >>> commands, expose metrics and so on, thus we need all components to be
> >>> initialized (some of them may be partially initialized due to their own
> >>> maintenance).
> >>> To achieve that we need to go through a full cycle of node
> initialization
> >>> including discovery initialization. When discovery is initialized (in
> >>> special isolated mode) I don't think it is easy to switch back to
> normal
> >>> operations without a restart.
> >>>
> >>> Ivan,
> >>>
> >>> I think of a node in MM as an almost normal one (maybe with some
> >> components

Re: [DISCUSSION] Maintenance Mode feature

2020-09-23 Thread Sergey Chugunov
Ivan,

If you come up with any ideas that may make this feature better, don't
hesitate to share them!

Thank you!

On Tue, Sep 22, 2020 at 11:27 AM Ivan Pavlukhin  wrote:

> Sergey,
>
> Thank you for your answer. While I am not happy with the proposed
> approach but things never were easy. Unfortunately I cannot suggest
> 100% better approaches so far. So, I should trust your vision.
>
> 2020-09-22 10:29 GMT+03:00, Sergey Chugunov :
> > Ivan,
> >
> > Checkpointer in Maintenance Mode is started and allows normal operations
> as
> > it may be needed for defragmentation and possibly other cases.
> >
> > Discovery is started with a special implementation of SPI that doesn't
> make
> > attempts to seek and/or connect to the rest of the cluster. From that
> > perspective node in MM is totally isolated.
> >
> > Communication is started as usual but I believe it doesn't matter as
> > discovery no other nodes are observed in topology and connection attempt
> > should not happen. But it may make sense to implement isolated version of
> > communication SPI as well to have 100% guarantee that no communication
> with
> > other nodes will happen.
> >
> > It is important to note that GridRestProcessor is started normally as we
> > need it to connect to the node via control utility.
> >
> > On Mon, Sep 21, 2020 at 7:04 PM Ivan Pavlukhin 
> wrote:
> >
> >> Sergey,
> >>
> >> > From  the code complexity perspective I'm trying to design the feature
> >> in such a way that all maintenance code is as encapsulated as possible
> >> and
> >> avoids massive interventions into main workflows of components.
> >>
> >> Could please briefly tell what means do you use to achieve
> >> encapsulation? Are Discovery, Communication, Checkpointer and other
> >> components started in a maintenance mode in current design?
> >>
> >> 2020-09-21 15:19 GMT+03:00, Nikolay Izhikov :
> >> > Hello, Sergey.
> >> >
> >> >> At the moment I'm aware about two use cases for this feature:
> >> >> corrupted
> >> >> PDS cleanup and defragmentation.
> >> >
> >> > AFAIKU There is third use-case for this mode.
> >> >
> >> > Change encryption master key in case node was down during cluster
> >> > master
> >> key
> >> > change.
> >> > In this case, node can’t join to the cluster, because it’s master key
> >> > differs from the cluster.
> >> > To recover node Ignite should locally change master key before join.
> >> >
> >> > Please, take a look into source code [1]
> >> >
> >> > [1]
> >> >
> >>
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/managers/encryption/GridEncryptionManager.java#L710
> >> >
> >> >> 21 сент. 2020 г., в 14:37, Sergey Chugunov <
> sergey.chugu...@gmail.com>
> >> >> написал(а):
> >> >>
> >> >> Ivan,
> >> >>
> >> >> Sorry for some confusion, MM indeed is not a normal mode. What I was
> >> >> trying
> >> >> to say is that when in MM node still starts and allows the user to
> >> >> perform
> >> >> actions with it like sending commands via control utility/JMX APIs or
> >> >> reading metrics.
> >> >>
> >> >> This is the key point: although the node is not in the cluster but it
> >> >> is
> >> >> still alive can be monitored and supports management to do
> >> >> maintenance.
> >> >>
> >> >> From  the code complexity perspective I'm trying to design the
> feature
> >> in
> >> >> such a way that all maintenance code is as encapsulated as possible
> >> >> and
> >> >> avoids massive interventions into main workflows of components.
> >> >> At the moment I'm aware about two use cases for this feature:
> >> >> corrupted
> >> >> PDS
> >> >> cleanup and defragmentation. As far as I know it won't bring too much
> >> >> complexity in both cases.
> >> >>
> >> >> I cannot say for other components but I believe it will be possible
> to
> >> >> integrate MM feature into their workflow as well with reasonable
> >> >> amount
> >> >> of
> >> >> refactoring.
> >> >>
> >> >

Re: [DISCUSSION] Maintenance Mode feature

2020-09-22 Thread Sergey Chugunov
Ivan,

Checkpointer in Maintenance Mode is started and allows normal operations as
it may be needed for defragmentation and possibly other cases.

Discovery is started with a special implementation of SPI that doesn't make
attempts to seek and/or connect to the rest of the cluster. From that
perspective node in MM is totally isolated.

Communication is started as usual but I believe it doesn't matter as
discovery no other nodes are observed in topology and connection attempt
should not happen. But it may make sense to implement isolated version of
communication SPI as well to have 100% guarantee that no communication with
other nodes will happen.

It is important to note that GridRestProcessor is started normally as we
need it to connect to the node via control utility.

On Mon, Sep 21, 2020 at 7:04 PM Ivan Pavlukhin  wrote:

> Sergey,
>
> > From  the code complexity perspective I'm trying to design the feature
> in such a way that all maintenance code is as encapsulated as possible and
> avoids massive interventions into main workflows of components.
>
> Could please briefly tell what means do you use to achieve
> encapsulation? Are Discovery, Communication, Checkpointer and other
> components started in a maintenance mode in current design?
>
> 2020-09-21 15:19 GMT+03:00, Nikolay Izhikov :
> > Hello, Sergey.
> >
> >> At the moment I'm aware about two use cases for this feature: corrupted
> >> PDS cleanup and defragmentation.
> >
> > AFAIKU There is third use-case for this mode.
> >
> > Change encryption master key in case node was down during cluster master
> key
> > change.
> > In this case, node can’t join to the cluster, because it’s master key
> > differs from the cluster.
> > To recover node Ignite should locally change master key before join.
> >
> > Please, take a look into source code [1]
> >
> > [1]
> >
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/managers/encryption/GridEncryptionManager.java#L710
> >
> >> 21 сент. 2020 г., в 14:37, Sergey Chugunov 
> >> написал(а):
> >>
> >> Ivan,
> >>
> >> Sorry for some confusion, MM indeed is not a normal mode. What I was
> >> trying
> >> to say is that when in MM node still starts and allows the user to
> >> perform
> >> actions with it like sending commands via control utility/JMX APIs or
> >> reading metrics.
> >>
> >> This is the key point: although the node is not in the cluster but it is
> >> still alive can be monitored and supports management to do maintenance.
> >>
> >> From  the code complexity perspective I'm trying to design the feature
> in
> >> such a way that all maintenance code is as encapsulated as possible and
> >> avoids massive interventions into main workflows of components.
> >> At the moment I'm aware about two use cases for this feature: corrupted
> >> PDS
> >> cleanup and defragmentation. As far as I know it won't bring too much
> >> complexity in both cases.
> >>
> >> I cannot say for other components but I believe it will be possible to
> >> integrate MM feature into their workflow as well with reasonable amount
> >> of
> >> refactoring.
> >>
> >> Does it make sense to you?
> >>
> >> On Sun, Sep 6, 2020 at 8:08 AM Ivan Pavlukhin 
> >> wrote:
> >>
> >>> Sergey,
> >>>
> >>> Thank you for your answer!
> >>>
> >>> Might be I am looking at the subject from a different angle.
> >>>
> >>>> I think of a node in MM as an almost normal one
> >>> I cannot think of such a mode as a normal one, because it apparently
> >>> does not perform usual cluster node functions. It is not a part of a
> >>> cluster, caches data is not available, Discovery and Communication are
> >>> not needed.
> >>>
> >>> I fear that with "node started in a special mode" approach we will get
> >>> an additional flag in the code making the code more complex and
> >>> fragile. Should not I worry about it?
> >>>
> >>> 2020-09-02 10:45 GMT+03:00, Sergey Chugunov  >:
> >>>> Vladislav, Ivan,
> >>>>
> >>>> Thank you for your questions and suggestions. Let me answer them.
> >>>>
> >>>> Vladislav,
> >>>>
> >>>> If I understood you correctly, you're talking about a node performing
> >>> some
> >>>> automatic actions to fix th

Re: [DISCUSSION] Maintenance Mode feature

2020-09-21 Thread Sergey Chugunov
Ivan,

Sorry for some confusion, MM indeed is not a normal mode. What I was trying
to say is that when in MM node still starts and allows the user to perform
actions with it like sending commands via control utility/JMX APIs or
reading metrics.

This is the key point: although the node is not in the cluster but it is
still alive can be monitored and supports management to do maintenance.

>From  the code complexity perspective I'm trying to design the feature in
such a way that all maintenance code is as encapsulated as possible and
avoids massive interventions into main workflows of components.
At the moment I'm aware about two use cases for this feature: corrupted PDS
cleanup and defragmentation. As far as I know it won't bring too much
complexity in both cases.

I cannot say for other components but I believe it will be possible to
integrate MM feature into their workflow as well with reasonable amount of
refactoring.

Does it make sense to you?

On Sun, Sep 6, 2020 at 8:08 AM Ivan Pavlukhin  wrote:

> Sergey,
>
> Thank you for your answer!
>
> Might be I am looking at the subject from a different angle.
>
> > I think of a node in MM as an almost normal one
> I cannot think of such a mode as a normal one, because it apparently
> does not perform usual cluster node functions. It is not a part of a
> cluster, caches data is not available, Discovery and Communication are
> not needed.
>
> I fear that with "node started in a special mode" approach we will get
> an additional flag in the code making the code more complex and
> fragile. Should not I worry about it?
>
> 2020-09-02 10:45 GMT+03:00, Sergey Chugunov :
> > Vladislav, Ivan,
> >
> > Thank you for your questions and suggestions. Let me answer them.
> >
> > Vladislav,
> >
> > If I understood you correctly, you're talking about a node performing
> some
> > automatic actions to fix the problem and then join the cluster as usual.
> >
> > However the original ticket [1] where we faced the need for Maintenance
> > Mode is about exactly the opposite: avoid doing automatic actions and
> give
> > a user the ability to decide what to do.
> >
> > Also the idea of Maintenance Mode is that the node is able to accept
> > commands, expose metrics and so on, thus we need all components to be
> > initialized (some of them may be partially initialized due to their own
> > maintenance).
> > To achieve that we need to go through a full cycle of node initialization
> > including discovery initialization. When discovery is initialized (in
> > special isolated mode) I don't think it is easy to switch back to normal
> > operations without a restart.
> >
> > Ivan,
> >
> > I think of a node in MM as an almost normal one (maybe with some
> components
> > skipped some steps of their initialization). Commands are accepted,
> > appropriate metrics are exposed e.g. through JMX API and so on.
> >
> > So as I see it we'll have special commands for control.{sh|bat} CLI
> > allowing user to see reasons why node switched to maintenance mode and/or
> > trigger actions to fix the problem (I'm still thinking about proper
> design
> > of these actions though).
> >
> > Of course the user should also be able to fix the problem manually e.g.
> by
> > manually deleting corrupted PDS files when node is down. Ideally
> > Maintenance Mode should be smart enough to figure that out and switch to
> > normal operations without a restart but I'm not sure if it is possible
> > without invasive changes of our components' lifecycle.
> > So I believe this model (node truly started in Maintenance Mode and new
> > commands in control.{sh|bat}) is a good fit for our current APIs and ways
> > to interact with the node.
> >
> > Does it sound reasonable to you?
> >
> > Thank you!
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-13366
> >
> > On Tue, Sep 1, 2020 at 2:07 PM Ivan Pavlukhin 
> wrote:
> >
> >> Sergey,
> >>
> >> Actually, I missed the point that the discussed mode affects a single
> >> node but not a whole cluster. Perhaps I mixed terms "mode" and
> >> "state".
> >>
> >> My next thoughts about maintenance routines are about special
> >> utilities. As far as I remember MySQL provides a bunch of scripts for
> >> various maintenance purposes. What user interface for maintenance
> >> tasks execution is assumed? And what do we mean by "starting" a node
> >> in a maintenance mode? Can we do some routines without "starting"
> >> (e.g. try to recover PDS or cleanup)?
> &g

Re: [DISCUSSION] Maintenance Mode feature

2020-09-02 Thread Sergey Chugunov
Vladislav, Ivan,

Thank you for your questions and suggestions. Let me answer them.

Vladislav,

If I understood you correctly, you're talking about a node performing some
automatic actions to fix the problem and then join the cluster as usual.

However the original ticket [1] where we faced the need for Maintenance
Mode is about exactly the opposite: avoid doing automatic actions and give
a user the ability to decide what to do.

Also the idea of Maintenance Mode is that the node is able to accept
commands, expose metrics and so on, thus we need all components to be
initialized (some of them may be partially initialized due to their own
maintenance).
To achieve that we need to go through a full cycle of node initialization
including discovery initialization. When discovery is initialized (in
special isolated mode) I don't think it is easy to switch back to normal
operations without a restart.

Ivan,

I think of a node in MM as an almost normal one (maybe with some components
skipped some steps of their initialization). Commands are accepted,
appropriate metrics are exposed e.g. through JMX API and so on.

So as I see it we'll have special commands for control.{sh|bat} CLI
allowing user to see reasons why node switched to maintenance mode and/or
trigger actions to fix the problem (I'm still thinking about proper design
of these actions though).

Of course the user should also be able to fix the problem manually e.g. by
manually deleting corrupted PDS files when node is down. Ideally
Maintenance Mode should be smart enough to figure that out and switch to
normal operations without a restart but I'm not sure if it is possible
without invasive changes of our components' lifecycle.
So I believe this model (node truly started in Maintenance Mode and new
commands in control.{sh|bat}) is a good fit for our current APIs and ways
to interact with the node.

Does it sound reasonable to you?

Thank you!

[1] https://issues.apache.org/jira/browse/IGNITE-13366

On Tue, Sep 1, 2020 at 2:07 PM Ivan Pavlukhin  wrote:

> Sergey,
>
> Actually, I missed the point that the discussed mode affects a single
> node but not a whole cluster. Perhaps I mixed terms "mode" and
> "state".
>
> My next thoughts about maintenance routines are about special
> utilities. As far as I remember MySQL provides a bunch of scripts for
> various maintenance purposes. What user interface for maintenance
> tasks execution is assumed? And what do we mean by "starting" a node
> in a maintenance mode? Can we do some routines without "starting"
> (e.g. try to recover PDS or cleanup)?
>
> 2020-08-31 23:41 GMT+03:00, Vladislav Pyatkov :
> > Hi Sergey.
> >
> > As I understand any switching from/to MM possible only through manual
> > restart a node.
> > But in your example that look like a technical actions, that only
> possible
> > in the case.
> > Do you plan to provide a possibility for client where he can make a
> > decision without a manual intervention?
> >
> > For example: Start node and manually agree with an option and after
> > automatically resolve conflict and back to topology as a stable node.
> >
> > On Mon, Aug 31, 2020 at 5:41 PM Sergey Chugunov <
> sergey.chugu...@gmail.com>
> > wrote:
> >
> >> Hello Ivan,
> >>
> >> Thank you for raising the good question, I didn't think of Maintenance
> >> Mode
> >> from that perspective.
> >>
> >> In short, Maintenance Mode isn't related to Cluster States concept.
> >> According to javadoc documentation of ClusterState enum [1] it is solely
> >> about cache operations and to some extent doesn't affect other
> components
> >> of Ignite node.
> >> From APIs perspective putting the methods to manage Cluster State to
> >> IgniteCluster interface doesn't look ideal to me but it is as it is.
> >>
> >> On the other hand Maintenance Mode as I see it will be managed through
> >> different APIs than a ClusterState and this difference definitely will
> be
> >> reflected in the documentation of the feature.
> >>
> >> Ignite node is a complex piece of many components interacting with each
> >> other, they may have different lifecycles and states; states of
> different
> >> components cannot be reduced to the lowest common denominator.
> >>
> >> However if you have an idea of how to call the feature better to let the
> >> user easier distinguish it from other similar features please share it
> >> with
> >> us. Personally I'm very welcome to any suggestions that make design more
> >> intuitive and easy-to-use.
> >>
> >> Thanks!
> >>
> >> [1]
> >>
> 

Re: [DISCUSSION] Maintenance Mode feature

2020-08-31 Thread Sergey Chugunov
Hello Ivan,

Thank you for raising the good question, I didn't think of Maintenance Mode
from that perspective.

In short, Maintenance Mode isn't related to Cluster States concept.
According to javadoc documentation of ClusterState enum [1] it is solely
about cache operations and to some extent doesn't affect other components
of Ignite node.
>From APIs perspective putting the methods to manage Cluster State to
IgniteCluster interface doesn't look ideal to me but it is as it is.

On the other hand Maintenance Mode as I see it will be managed through
different APIs than a ClusterState and this difference definitely will be
reflected in the documentation of the feature.

Ignite node is a complex piece of many components interacting with each
other, they may have different lifecycles and states; states of different
components cannot be reduced to the lowest common denominator.

However if you have an idea of how to call the feature better to let the
user easier distinguish it from other similar features please share it with
us. Personally I'm very welcome to any suggestions that make design more
intuitive and easy-to-use.

Thanks!

[1]
https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/cluster/ClusterState.java

On Mon, Aug 31, 2020 at 12:32 PM Ivan Pavlukhin  wrote:

> Hi Sergey,
>
> Thank you for bringing attention to that important subject!
>
> My note here is about one more cluster mode. As far as I know
> currently we already have 3 modes (inactive, read-only, read-write)
> and the subject is about one more. From the first glance it could be
> hard for a user to understand and use all modes properly. Do we really
> need all spectrum? Could we simplify things somehow?
>
> 2020-08-27 15:59 GMT+03:00, Sergey Chugunov :
> > Hello Nikolay,
> >
> > Created one, available by link [1]
> >
> > Initially there was an intention to develop it under IEP-47 [2] and there
> > is even a separate section for Maintenance Mode there.
> > But it looks like this feature is useful in more cases and deserves its
> own
> > IEP.
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-53%3A+Maintenance+Mode
> > [2]
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation
> >
> > On Thu, Aug 27, 2020 at 11:01 AM Nikolay Izhikov 
> > wrote:
> >
> >> Hello, Sergey!
> >>
> >> Thanks for the proposal.
> >> Let’s have IEP for this feature.
> >>
> >> > 27 авг. 2020 г., в 10:25, Sergey Chugunov 
> >> написал(а):
> >> >
> >> > Hello Igniters,
> >> >
> >> > I want to start a discussion about new supporting feature that could
> be
> >> > very useful in many scenarios where persistent storage is involved:
> >> > Maintenance Mode.
> >> >
> >> > *Summary*
> >> > Maintenance Mode (MM for short) is a special state of Ignite node when
> >> node
> >> > doesn't serve user requests nor joins the cluster but waits for user
> >> > commands or performs automatic actions for maintenance purposes.
> >> >
> >> > *Motivation*
> >> > There are situations when node cannot participate in regular
> operations
> >> but
> >> > at the same time should not be shut down.
> >> >
> >> > One example is a ticket [1] where I developed the first draft of
> >> > Maintenance Mode.
> >> > Here we get into a situation when node has potentially corrupted PDS
> >> > thus
> >> > cannot proceed with restore routine and join the cluster as usual.
> >> > At the same time node should not fail nor be stopped for manual
> >> > cleanup.
> >> > Manual cleanup is not always an option (e.g. restricted access to file
> >> > system); in managed environments failed node will be restarted
> >> > automatically so user won't have time for performing necessary
> >> operations.
> >> > Thus node needs to function in a special mode allowing user to connect
> >> > to
> >> > it and perform necessary actions.
> >> >
> >> > Another example is described in IEP-47 [2] where defragmentation is
> >> > being
> >> > developed. Node defragmenting its PDS should not join the cluster
> until
> >> the
> >> > process is finished so it needs to enter Maintenance Mode as well.
> >> >
> >> > *Suggested design*
> >> > I suggest MM to work as follows:
> >> > 1. Node enters MM if special markers are found on disk. These

Re: [DISCUSSION] Maintenance Mode feature

2020-08-27 Thread Sergey Chugunov
Hello Nikolay,

Created one, available by link [1]

Initially there was an intention to develop it under IEP-47 [2] and there
is even a separate section for Maintenance Mode there.
But it looks like this feature is useful in more cases and deserves its own
IEP.

[1]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-53%3A+Maintenance+Mode
[2]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation

On Thu, Aug 27, 2020 at 11:01 AM Nikolay Izhikov 
wrote:

> Hello, Sergey!
>
> Thanks for the proposal.
> Let’s have IEP for this feature.
>
> > 27 авг. 2020 г., в 10:25, Sergey Chugunov 
> написал(а):
> >
> > Hello Igniters,
> >
> > I want to start a discussion about new supporting feature that could be
> > very useful in many scenarios where persistent storage is involved:
> > Maintenance Mode.
> >
> > *Summary*
> > Maintenance Mode (MM for short) is a special state of Ignite node when
> node
> > doesn't serve user requests nor joins the cluster but waits for user
> > commands or performs automatic actions for maintenance purposes.
> >
> > *Motivation*
> > There are situations when node cannot participate in regular operations
> but
> > at the same time should not be shut down.
> >
> > One example is a ticket [1] where I developed the first draft of
> > Maintenance Mode.
> > Here we get into a situation when node has potentially corrupted PDS thus
> > cannot proceed with restore routine and join the cluster as usual.
> > At the same time node should not fail nor be stopped for manual cleanup.
> > Manual cleanup is not always an option (e.g. restricted access to file
> > system); in managed environments failed node will be restarted
> > automatically so user won't have time for performing necessary
> operations.
> > Thus node needs to function in a special mode allowing user to connect to
> > it and perform necessary actions.
> >
> > Another example is described in IEP-47 [2] where defragmentation is being
> > developed. Node defragmenting its PDS should not join the cluster until
> the
> > process is finished so it needs to enter Maintenance Mode as well.
> >
> > *Suggested design*
> > I suggest MM to work as follows:
> > 1. Node enters MM if special markers are found on disk. These markers
> > called Maintenance Records could be created automatically (e.g. when
> > storage component detects corrupted storage) or by user request (when
> user
> > requests defragmentation of some caches). So entering MM requires node
> > restart.
> > 2. Started in MM node doesn't join the cluster but finishes startup
> routine
> > so it is able to receive commands and provide metrics to the user.
> > 3. When all necessary maintenance operations are finished, Maintenance
> > Records for these operations are deleted from disk and node restarted
> again
> > to enter normal service.
> >
> > *Example*
> > To put it into a context let's consider an example of how I see the MM
> > workflow in case of PDS corruption.
> >
> >   1. Node has failed in the middle of checkpoint when WAL is disabled for
> >   a particular cache -> data files of the cache are potentially
> corrupted.
> >   2. On next startup node detects this situation, creates Maintenance
> >   Record on disk and shuts down.
> >   3. On next startup node sees Maintenance Record, enters Maintenance
> Mode
> >   and waits for user to do specific actions: clean potentially corrupted
> PDS.
> >   4. When user has done necessary actions he/she removes Maintenance
> >   Record using Maintenance Mode API exposed via control.{sh|bat} script
> or
> >   JMX.
> >   5. On next startup node goes to normal operations as maintenance reason
> >   is fixed.
> >
> >
> > I prepared a PR [3] for ticket [1] with draft implementation. It is not
> > ready to be merged to master branch but is already fully functional and
> can
> > be reviewed.
> >
> > Hope you'll share your feedback on the feature and/or any thoughts on
> > implementation.
> >
> > Thank you!
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-13366
> > [2]
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation
> > [3] https://github.com/apache/ignite/pull/8189
>
>


[DISCUSSION] Maintenance Mode feature

2020-08-27 Thread Sergey Chugunov
Hello Igniters,

I want to start a discussion about new supporting feature that could be
very useful in many scenarios where persistent storage is involved:
Maintenance Mode.

*Summary*
Maintenance Mode (MM for short) is a special state of Ignite node when node
doesn't serve user requests nor joins the cluster but waits for user
commands or performs automatic actions for maintenance purposes.

*Motivation*
There are situations when node cannot participate in regular operations but
at the same time should not be shut down.

One example is a ticket [1] where I developed the first draft of
Maintenance Mode.
Here we get into a situation when node has potentially corrupted PDS thus
cannot proceed with restore routine and join the cluster as usual.
At the same time node should not fail nor be stopped for manual cleanup.
Manual cleanup is not always an option (e.g. restricted access to file
system); in managed environments failed node will be restarted
automatically so user won't have time for performing necessary operations.
Thus node needs to function in a special mode allowing user to connect to
it and perform necessary actions.

Another example is described in IEP-47 [2] where defragmentation is being
developed. Node defragmenting its PDS should not join the cluster until the
process is finished so it needs to enter Maintenance Mode as well.

*Suggested design*
I suggest MM to work as follows:
1. Node enters MM if special markers are found on disk. These markers
called Maintenance Records could be created automatically (e.g. when
storage component detects corrupted storage) or by user request (when user
requests defragmentation of some caches). So entering MM requires node
restart.
2. Started in MM node doesn't join the cluster but finishes startup routine
so it is able to receive commands and provide metrics to the user.
3. When all necessary maintenance operations are finished, Maintenance
Records for these operations are deleted from disk and node restarted again
to enter normal service.

*Example*
To put it into a context let's consider an example of how I see the MM
workflow in case of PDS corruption.

   1. Node has failed in the middle of checkpoint when WAL is disabled for
   a particular cache -> data files of the cache are potentially corrupted.
   2. On next startup node detects this situation, creates Maintenance
   Record on disk and shuts down.
   3. On next startup node sees Maintenance Record, enters Maintenance Mode
   and waits for user to do specific actions: clean potentially corrupted PDS.
   4. When user has done necessary actions he/she removes Maintenance
   Record using Maintenance Mode API exposed via control.{sh|bat} script or
   JMX.
   5. On next startup node goes to normal operations as maintenance reason
   is fixed.


I prepared a PR [3] for ticket [1] with draft implementation. It is not
ready to be merged to master branch but is already fully functional and can
be reviewed.

Hope you'll share your feedback on the feature and/or any thoughts on
implementation.

Thank you!

[1] https://issues.apache.org/jira/browse/IGNITE-13366
[2]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation
[3] https://github.com/apache/ignite/pull/8189


[jira] [Created] (IGNITE-13367) meta --remove command usage improvements

2020-08-18 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-13367:


 Summary: meta --remove command usage improvements
 Key: IGNITE-13367
 URL: https://issues.apache.org/jira/browse/IGNITE-13367
 Project: Ignite
  Issue Type: Improvement
Reporter: Sergey Chugunov
Assignee: Sergey Chugunov
 Fix For: 2.10


Command for removing metadata has the following issues:
# In 'Type not found' scenario it prints long stack traces to console instead 
of short information about requested type.
# When used it registers some internal classes which are not supposed to go 
through binary metadata registration protocol.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-13366) Prohibit unconditional automatic deletion of data files if WAL was disabled prior to node's shutdown

2020-08-17 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-13366:


 Summary: Prohibit unconditional automatic deletion of data files 
if WAL was disabled prior to node's shutdown
 Key: IGNITE-13366
 URL: https://issues.apache.org/jira/browse/IGNITE-13366
 Project: Ignite
  Issue Type: Task
  Components: persistence
Affects Versions: 2.8.1
Reporter: Sergey Chugunov
Assignee: Sergey Chugunov
 Fix For: 2.10


If node with persistence is stopped when WAL was disabled for a cache (no 
matters because of rebalancing in progress or by explicit user request) on next 
node start all data files of that cache are removed automatically and 
unconditionally.

This behavior may be unexpected for users as they may not understand all 
consequences of disabling WAL locally (for rebalancing) or globally (via 
IgniteCluster API call). Also it is not smart enough as there is no point in 
deleting consistent data files.

We should change this behavior to the following list: no automatic deletions 
whatsoever. If data files are consistent (equivalent to: no checkpoint was 
running when node was stopped) start up normally. If data files are corrupted, 
don't let the node start.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-13260) Improve javadoc documentation for FilePageStore abstraction.

2020-07-15 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-13260:


 Summary: Improve javadoc documentation for FilePageStore 
abstraction.
 Key: IGNITE-13260
 URL: https://issues.apache.org/jira/browse/IGNITE-13260
 Project: Ignite
  Issue Type: Task
Reporter: Sergey Chugunov
 Fix For: 2.10


FilePageStore class javadoc comment doesn't provide any useful information 
about role of this important class in the whole picture of Ignite Native 
Persistence.

We need to add information about responsibilities of the class and its 
relationships with other classes in Ignite Persistence module.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-13239) Document APIs to view and change Cluster ID and Tag

2020-07-10 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-13239:


 Summary: Document APIs to view and change Cluster ID and Tag
 Key: IGNITE-13239
 URL: https://issues.apache.org/jira/browse/IGNITE-13239
 Project: Ignite
  Issue Type: Task
Reporter: Sergey Chugunov


In IGNITE-13185 new APIs and changes were introduced to view Cluster ID and Tag 
and change Tag.

These APIs and use cases need to be documented.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-13212) Peer class loading does not work for Scan Query

2020-07-03 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-13212:


 Summary: Peer class loading does not work for Scan Query
 Key: IGNITE-13212
 URL: https://issues.apache.org/jira/browse/IGNITE-13212
 Project: Ignite
  Issue Type: Bug
Reporter: Sergey Chugunov
Assignee: Sergey Chugunov
 Fix For: 2.9


When a scan query with transformer is executed via API {{IgniteCache::query}} 
and class passed as a transformer is not available on remote nodes, p2p 
mechanism is not triggered and exception is thrown on server nodes executing 
query.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-13190) Core defragmentation functions

2020-06-26 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-13190:


 Summary: Core defragmentation functions
 Key: IGNITE-13190
 URL: https://issues.apache.org/jira/browse/IGNITE-13190
 Project: Ignite
  Issue Type: Sub-task
Reporter: Sergey Chugunov


The following set of functions covering defragmentation happy-case needed:
 * Initialization of defragmentation manager when node is started in 
maintenance mode.
 * Information about partition files is gathered by defrag mgr.
 * For each partition file corresponding file of defragmented partition is 
created and initialized.
 * Keys are transferred from old partitions to new partitions.
 * Checkpointer is aware of new partition files and flushes defragmented memory 
to new partition files.

 

No fault-tolerance code nor index defragmentation mappings are needed in this 
task.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-13189) Maintenance mode switch and defragmentation process initialization

2020-06-26 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-13189:


 Summary: Maintenance mode switch and defragmentation process 
initialization
 Key: IGNITE-13189
 URL: https://issues.apache.org/jira/browse/IGNITE-13189
 Project: Ignite
  Issue Type: Sub-task
Reporter: Sergey Chugunov


As described in IEP-47 defragmentation is performed when a node enters a 
special mode called maintenance mode.

Discussion on dev-list clarifies algorithm to enter maintenance mode:
 # Special key is written to local metastorage.
 # Node is restarted.
 # Node observes the key on startup and enters maintenance mode.

Node should be fully-functioning in that mode but should not join the rest of 
the cluster and participate in any regular activity like handling cache 
operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-13185) API to change Cluster Tag and notify about change of Cluster Tag

2020-06-25 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-13185:


 Summary: API to change Cluster Tag and notify about change of 
Cluster Tag
 Key: IGNITE-13185
 URL: https://issues.apache.org/jira/browse/IGNITE-13185
 Project: Ignite
  Issue Type: Improvement
Reporter: Sergey Chugunov
Assignee: Sergey Chugunov


IGNITE-12111 introduced new feature to identify and distinguish different 
clusters.

To make the feature more usable we need new command in CLI interface to change 
Cluster Tag and new event to subscribe for changes of Cluster Tag.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-13182) Document Cluster ID and Tag feature

2020-06-25 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-13182:


 Summary: Document Cluster ID and Tag feature
 Key: IGNITE-13182
 URL: https://issues.apache.org/jira/browse/IGNITE-13182
 Project: Ignite
  Issue Type: Task
Reporter: Sergey Chugunov


IGNITE-12111 introduced new feature to identify and give a name to the cluster: 
Cluster ID and Tag.

Feature in general and APIs to manage it in particular need to be documented.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSSION] New Ignite settings for IGNITE-12438 and IGNITE-13013

2020-06-16 Thread Sergey Chugunov
Val,

I like your suggestion about naming, it describes the purpose of the
configuration the best.

On Tue, Jun 16, 2020 at 5:18 PM Ivan Bessonov  wrote:

> Hi,
>
> I created new issue that describes CQ problem in more details: [1]
> I'm fine with experimental status and new property naming.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-13156
>
> вт, 16 июн. 2020 г. в 02:20, Valentin Kulichenko <
> valentin.kuliche...@gmail.com>:
>
> > Folks,
> >
> > Thanks for providing the detailed clarifications. Let's add the
> parameter,
> > mark the new feature as experimental, and target for making it the
> default
> > mode in Ignite 3.0.
> >
> > I still don't think we can come up with a naming that is really
> intuitive,
> > but let's try to simplify it as much as possible. How about this:
> >
> > TcpCommunicationSpi#forceClientToServerConnections -- false by default,
> > true if the new mode needs to be enabled.
> >
> > Let me know your thoughts.
> >
> > -Val
> >
> > On Wed, Jun 10, 2020 at 4:10 PM Denis Magda  wrote:
> >
> > > Sergey,
> > >
> > > Thanks for the detailed explanation and for covering all corner cases.
> > >
> > > Considering the improvement's criticality, I would continue moving in
> the
> > > initial direction and add that particular configuration property.
> > > Potentially, we can put more effort throughout an Ignite 3.0 timeframe
> > and
> > > remove the property altogether. @Valentin Kulichenko
> > > , could you please suggest any alternate
> > naming?
> > >
> > > Btw, what are the specifics of the issue with continuous queries? It
> will
> > > be ideal if we could release this new communication option in the GA
> > state
> > > in 2.9.
> > >
> > > -
> > > Denis
> > >
> > >
> > > On Wed, Jun 10, 2020 at 1:22 AM Sergey Chugunov <
> > sergey.chugu...@gmail.com
> > > >
> > > wrote:
> > >
> > > > Denis, Val,
> > > >
> > > > Idea of prohibiting servers to open connections to clients and force
> > > > clients to always open "inverse connections" to servers looks
> > promising.
> > > To
> > > > be clear, by "inverse connections" I mean here that server needed to
> > > > communicate with client requests client to open a connection back
> > instead
> > > > of opening connection by itself using addresses published by the
> > client.
> > > >
> > > > If we apply the idea it will indeed allow us to simplify our
> > > configuration
> > > > (no need for new configuration property), another advantage is
> clients
> > > > won't need to publish their addresses anymore (with one side note
> I'll
> > > > cover at the end), it will also simplify our code.
> > > >
> > > > However applying it with current implementation of inverse connection
> > > > request (when request goes across all ring) may bring significant
> delay
> > > of
> > > > opening first connection depending on cluster size and relative
> > positions
> > > > between server that needs to communicate with client (target server)
> > and
> > > > client's router node.
> > > >
> > > > It is possible to overcome this by sending inverse connection request
> > not
> > > > via discovery but directly to router server node via communication
> and
> > > > convert to discovery message only on the router. We'll still have two
> > > hops
> > > > of communication request instead of one plus discovery worker on
> client
> > > may
> > > > be busy working on other stuff slowing down handling of connection
> > > request.
> > > > But it should be fine.
> > > >
> > > > However with this solution it is hard to implement failover of router
> > > node:
> > > > let me describe it in more details.
> > > > In case of router node failure target server won't be able to
> determine
> > > if
> > > > client received inverse comm request successfully and (even worse)
> > won't
> > > be
> > > > able to figure out new router for the client without waiting for
> > > discovery
> > > > event of the client reconnect.
> > > > And this brings us to the following choise: we either wait for
> > discovery
> > > > event about 

[jira] [Created] (IGNITE-13151) Checkpointer code refactoring

2020-06-15 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-13151:


 Summary: Checkpointer code refactoring
 Key: IGNITE-13151
 URL: https://issues.apache.org/jira/browse/IGNITE-13151
 Project: Ignite
  Issue Type: Sub-task
  Components: persistence
Reporter: Sergey Chugunov


Checkpointer is at the center of Ignite persistence subsystem and more people 
from the community understand it the better means it is more stable and more 
efficient.

However for now checkpointer code sits inside of GridCacheDatabaseSharedManager 
class and is entangled with this higher-level and more general component.

To take a step forward to more modular checkpointer we need to do two things:
 # Move checkpointer code outside database manager to a separate class.
 # Create a well-defined API of checkpointer that will allow us to create new 
implementations of checkpointer in the future. An example of this is new 
checkpointer implementation needed for defragmentation feature purposes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-13143) Persistent store defragmentation

2020-06-10 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-13143:


 Summary: Persistent store defragmentation
 Key: IGNITE-13143
 URL: https://issues.apache.org/jira/browse/IGNITE-13143
 Project: Ignite
  Issue Type: New Feature
Reporter: Sergey Chugunov


Persistent store enables users to store data of their caches in a durable 
fashion on disk still benefiting from in-memory nature of Apache Ignite. Data 
of caches is stored in files created for every primary or backup partition 
assigned to that node and in an additional file for all user indexes.

Files in filesystem are allocated lazily (only if some data is actually stored 
to particular partition) and grow automatically when more data is added to the 
cache. But the problem is that files cannot shrink even if all data is removed.

This umbrella ticket covers all other tasks needed to implement simple yet 
effective approach to defragmentation. Detailed discussion could be found in 
[IEP-47|https://cwiki.apache.org/confluence/display/IGNITE/IEP-47%3A+Native+persistence+defragmentation]
 and in corresponding [dev-list 
discussion|http://apache-ignite-developers.2346864.n4.nabble.com/DISCUSSION-IEP-47-Native-persistence-defragmentation-td47717.html]
 but core ideas are as follows:
 # Defragmentation is performed in a special _maintenance_ mode when node 
starts, provides access to some APIs like metrics or JMX management but doesn't 
join the cluster.
 # It is performed by copying all data from all partitions on node to new files 
with automatic compaction. After successful copy old partition files are 
deleted.
 # Metrics on progress of the operation are provided to the user.
 # Operation is fault-tolerant and in case of node failure proceeds after node 
restart.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-13141) Modify .NET counterpart of IgniteCluster to include functionality of Cluster ID and tag

2020-06-10 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-13141:


 Summary: Modify .NET counterpart of IgniteCluster to include 
functionality of Cluster ID and tag
 Key: IGNITE-13141
 URL: https://issues.apache.org/jira/browse/IGNITE-13141
 Project: Ignite
  Issue Type: Task
Reporter: Sergey Chugunov


After  implementation of  IGNITE-12111 .NET tests showed broken API parity in 
new methods in IgniteCluster interface.

We need to implement the same functionality on .NET side (see description in 
linked ticket).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSSION] New Ignite settings for IGNITE-12438 and IGNITE-13013

2020-06-10 Thread Sergey Chugunov
Denis, Val,

Idea of prohibiting servers to open connections to clients and force
clients to always open "inverse connections" to servers looks promising. To
be clear, by "inverse connections" I mean here that server needed to
communicate with client requests client to open a connection back instead
of opening connection by itself using addresses published by the client.

If we apply the idea it will indeed allow us to simplify our configuration
(no need for new configuration property), another advantage is clients
won't need to publish their addresses anymore (with one side note I'll
cover at the end), it will also simplify our code.

However applying it with current implementation of inverse connection
request (when request goes across all ring) may bring significant delay of
opening first connection depending on cluster size and relative positions
between server that needs to communicate with client (target server) and
client's router node.

It is possible to overcome this by sending inverse connection request not
via discovery but directly to router server node via communication and
convert to discovery message only on the router. We'll still have two hops
of communication request instead of one plus discovery worker on client may
be busy working on other stuff slowing down handling of connection request.
But it should be fine.

However with this solution it is hard to implement failover of router node:
let me describe it in more details.
In case of router node failure target server won't be able to determine if
client received inverse comm request successfully and (even worse) won't be
able to figure out new router for the client without waiting for discovery
event of the client reconnect.
And this brings us to the following choise: we either wait for discovery
event about client reconnect (this is deadlock-prone as current protocol of
CQ registration opens comm connection to client right from discovery thread
in some cases; if we wait for new discovery event from discovery thread, it
is a deadlock) or we fail opening the connection by timeout thus adding new
scenarios when opening connection may fail.

Thus implementing communication model "clients connect to servers, servers
never connect to clients" efficiently requires more work on different parts
of our functionality and rigorous testing of readiness of our code for more
communication connection failures.

So to me the least risky decision is not to delete new configuration but
leave it with experimental status. If we find out that direct request
(server -> router server -> target client) implementation works well and
doesn't bring much complexity in failover scenarios we'll remove that
configuration and prohibit servers to open connections to clients by
default.

Side note: there are rare but yet possible scenarios where client node
needs to open communication connection to other client node. If we let
clients not to publish their addresses these scenarios will stop working
without additional logic like sending data through router node. As far as I
know client-client connectivity is involved in p2p class deployment
scenarios, does anyone know about other cases?

--
Thanks,
Sergey Chugunov

On Wed, Jun 3, 2020 at 5:37 PM Denis Magda  wrote:

> Ivan,
>
> It feels like Val is driving us in the right direction. Is there any reason
> for keeping the current logic when servers can open connections to clients?
>
> -
> Denis
>
>
> On Thu, May 21, 2020 at 4:48 PM Valentin Kulichenko <
> valentin.kuliche...@gmail.com> wrote:
>
> > Ivan,
> >
> > Have you considered eliminating server to client connections altogether?
> > Or, at the very least making the "client to server only" mode the default
> > one?
> >
> > All the suggested names are confusing and not intuitive, and I doubt we
> > will be able to find a good one. A server initiating a TCP connection
> with
> > a client is confusing in the first place and creates a usability issue.
> We
> > now want to solve it by introducing an additional configuration
> > parameter, and therefore additional complexity. I don't think this is the
> > right approach.
> >
> > What are the drawbacks of permanently switching to client-to-server
> > connections? Is there any value provided by the server-to-client option?
> >
> > As for pair connections, I'm not sure I understand why there is a
> > limitation. As far as I know, the idea behind this feature is that we
> > maintain two connections between two nodes instead of one, so that every
> > connection is used for communication in a single direction only. Why does
> > it matter which node initiates the connection? Why can't one of the nodes
> > (e.g., a client) initiate both connections, and then use them
> accordingly?
> > Correct me 

Re: Question: network issues of single node.

2020-06-08 Thread Sergey Chugunov
Of course I meant ticket [1] increased cluster stability in situation of
blinking network.

[1] https://issues.apache.org/jira/browse/IGNITE-7163

On Mon, Jun 8, 2020 at 1:51 PM Sergey Chugunov 
wrote:

> Vladimir,
>
> Adding to what Alexey has said I remember that cases of short-term network
> issues (blinking network) were also a driver for this improvement. They are
> indeed hard to reproduce but have been seen in real world set-ups and have
> proven to increase cluster stability.
>
> On Sat, Jun 6, 2020 at 5:09 PM Denis Magda  wrote:
>
>> Finally, I got your question.
>>
>> Back in 2017-2018, there was a Discovery SPI's stabilization activity. The
>> networking component could fail in various hard-to-reproduce scenarios
>> affecting cluster availability and consistency. That ticket reminds me of
>> those notorious issues that would fire once a week or month under specific
>> configuration settings. So, I would not touch the code that fixes the
>> issue
>> unless @Alexey Goncharuk  or @Sergey Chugunov
>>  confirms that it's safe to do. Also, there
>> should
>> be a test for this scenario.
>>
>> -
>> Denis
>>
>>
>> On Fri, Jun 5, 2020 at 12:28 AM Vladimir Steshin 
>> wrote:
>>
>> > Denis,
>> >
>> > I have no nodes that I'm unable to interconnect. This case is simulated
>> > in IgniteDiscoveryMassiveNodeFailTest.testMassiveFailSelfKill()
>> > Introduced in [1].
>> >
>> > I’m asking if it is real or supposed problem. Where it was met? Which
>> > network configuration/issues could be?
>> >
>> >
>> > [1] https://issues.apache.org/jira/browse/IGNITE-7163
>> >
>> > 05.06.2020 1:01, Denis Magda пишет:
>> > > Vladimir,
>> > >
>> > > I'm suggesting to share the log files from the nodes that are unable
>> to
>> > > interconnect so that the community can check them for potential
>> issues.
>> > > Instead of sharing the logs from all the 5 nodes, try to start a
>> > two-nodes
>> > > cluster with the nodes that fail to discover each other and attach the
>> > logs
>> > > from those.
>> > >
>> > > -
>> > > Denis
>> > >
>> > >
>> > > On Thu, Jun 4, 2020 at 1:57 PM Vladimir Steshin 
>> > wrote:
>> > >
>> > >> Denis, hi.
>> > >>
>> > >>   Sorry, I didn’t catch your idea. Are you saying this can happen
>> > and
>> > >> suggest experiment? I’m not descripting a probable case. It is
>> already
>> > >> done in [1]. I’m asking is it real, where it was met.
>> > >>
>> > >>
>> > >> 04.06.2020 23:33, Denis Magda пишет:
>> > >>> Vladimir,
>> > >>>
>> > >>> Please do the following experiment. Start a 2-nodes cluster booting
>> > node
>> > >> 3
>> > >>> and, for instance, node 5. Those won't be able to interconnect
>> > according
>> > >> to
>> > >>> your description. Attach the log files from both nodes for analysis.
>> > This
>> > >>> should be a networking issue.
>> > >>>
>> > >>> -
>> > >>> Denis
>> > >>>
>> > >>>
>> > >>> On Thu, Jun 4, 2020 at 1:24 PM Vladimir Steshin > >
>> > >> wrote:
>> > >>>>Hi, Igniters.
>> > >>>>
>> > >>>>
>> > >>>>I wanted to ask how one node may not be able to connect to
>> > another
>> > >>>> whereas rest of the cluster can. This got covered in [1]. In short:
>> > node
>> > >>>> 3 can't connect to nodes 4 and 5 but can to 1. At the same time,
>> node
>> > 2
>> > >>>> can connect to 4. Questions:
>> > >>>>
>> > >>>> 1) Is it real case? Where this problem came from?
>> > >>>>
>> > >>>> 2) If node 3 can’t connect to 4 and 5, does it mean node 2 can’t
>> > connect
>> > >>>> to 4 (and 5) too?
>> > >>>>
>> > >>>> Sergey, Dmitry maybe you bring light (I see you in [1])? I'm
>> > >>>> participating in [2] and found this backward connection checking.
>> > >>>> Answering would help us a lot.
>> > >>>>
>> > >>>> Thanks!
>> > >>>>
>> > >>>> [1]
>> > >>>> https://issues.apache.org/jira/browse/IGNITE-7163<
>> > >>>> https://issues.apache.org/jira/browse/IGNITE-7163>
>> > >>>>
>> > >>>> [2]
>> > >>>>
>> > >>>>
>> > >>
>> >
>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-45%3A+Crash+Recovery+Speed-Up
>> > >>>> <
>> > >>>>
>> > >>
>> >
>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-45%3A+Crash+Recovery+Speed-Up
>> >
>>
>


Re: Question: network issues of single node.

2020-06-08 Thread Sergey Chugunov
Vladimir,

Adding to what Alexey has said I remember that cases of short-term network
issues (blinking network) were also a driver for this improvement. They are
indeed hard to reproduce but have been seen in real world set-ups and have
proven to increase cluster stability.

On Sat, Jun 6, 2020 at 5:09 PM Denis Magda  wrote:

> Finally, I got your question.
>
> Back in 2017-2018, there was a Discovery SPI's stabilization activity. The
> networking component could fail in various hard-to-reproduce scenarios
> affecting cluster availability and consistency. That ticket reminds me of
> those notorious issues that would fire once a week or month under specific
> configuration settings. So, I would not touch the code that fixes the issue
> unless @Alexey Goncharuk  or @Sergey Chugunov
>  confirms that it's safe to do. Also, there should
> be a test for this scenario.
>
> -
> Denis
>
>
> On Fri, Jun 5, 2020 at 12:28 AM Vladimir Steshin 
> wrote:
>
> > Denis,
> >
> > I have no nodes that I'm unable to interconnect. This case is simulated
> > in IgniteDiscoveryMassiveNodeFailTest.testMassiveFailSelfKill()
> > Introduced in [1].
> >
> > I’m asking if it is real or supposed problem. Where it was met? Which
> > network configuration/issues could be?
> >
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-7163
> >
> > 05.06.2020 1:01, Denis Magda пишет:
> > > Vladimir,
> > >
> > > I'm suggesting to share the log files from the nodes that are unable to
> > > interconnect so that the community can check them for potential issues.
> > > Instead of sharing the logs from all the 5 nodes, try to start a
> > two-nodes
> > > cluster with the nodes that fail to discover each other and attach the
> > logs
> > > from those.
> > >
> > > -
> > > Denis
> > >
> > >
> > > On Thu, Jun 4, 2020 at 1:57 PM Vladimir Steshin 
> > wrote:
> > >
> > >> Denis, hi.
> > >>
> > >>   Sorry, I didn’t catch your idea. Are you saying this can happen
> > and
> > >> suggest experiment? I’m not descripting a probable case. It is already
> > >> done in [1]. I’m asking is it real, where it was met.
> > >>
> > >>
> > >> 04.06.2020 23:33, Denis Magda пишет:
> > >>> Vladimir,
> > >>>
> > >>> Please do the following experiment. Start a 2-nodes cluster booting
> > node
> > >> 3
> > >>> and, for instance, node 5. Those won't be able to interconnect
> > according
> > >> to
> > >>> your description. Attach the log files from both nodes for analysis.
> > This
> > >>> should be a networking issue.
> > >>>
> > >>> -
> > >>> Denis
> > >>>
> > >>>
> > >>> On Thu, Jun 4, 2020 at 1:24 PM Vladimir Steshin 
> > >> wrote:
> > >>>>Hi, Igniters.
> > >>>>
> > >>>>
> > >>>>I wanted to ask how one node may not be able to connect to
> > another
> > >>>> whereas rest of the cluster can. This got covered in [1]. In short:
> > node
> > >>>> 3 can't connect to nodes 4 and 5 but can to 1. At the same time,
> node
> > 2
> > >>>> can connect to 4. Questions:
> > >>>>
> > >>>> 1) Is it real case? Where this problem came from?
> > >>>>
> > >>>> 2) If node 3 can’t connect to 4 and 5, does it mean node 2 can’t
> > connect
> > >>>> to 4 (and 5) too?
> > >>>>
> > >>>> Sergey, Dmitry maybe you bring light (I see you in [1])? I'm
> > >>>> participating in [2] and found this backward connection checking.
> > >>>> Answering would help us a lot.
> > >>>>
> > >>>> Thanks!
> > >>>>
> > >>>> [1]
> > >>>> https://issues.apache.org/jira/browse/IGNITE-7163<
> > >>>> https://issues.apache.org/jira/browse/IGNITE-7163>
> > >>>>
> > >>>> [2]
> > >>>>
> > >>>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-45%3A+Crash+Recovery+Speed-Up
> > >>>> <
> > >>>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-45%3A+Crash+Recovery+Speed-Up
> >
>


Re: [DISCUSSION] IEP-47 Native persistence defragmentation

2020-06-01 Thread Sergey Chugunov
Hi Ivan,

I have an idea about suggested maintenance mode.

First of all, I agree with your ideas about discovery restrictions: node
should not join topology when performing defragmentation.

At the same time I haven't heard about requests for this mode from users,
so we don't know much about possible requirements.
So I suggest to implement it in a pragmatical way: instead of inventing
(unknown in reality) user scenarios lets develop minimal but yet
well-designed functionality that suites our case. If we constrain our
implementation with reasonable set of restrictions that's OK.

So my idea is the following: to transit a node to maintenance user has to
send special command to the node (e.g. with new command in control.sh
utility or via JMX interface). Node saves maintenance request in local
metastorage and waits for restart. User has to manually restart that node
in order to finish moving it to maintenance mode.

When node restarts and finds maintenance request it creates special type of
discovery SPI that will not try to join topology at all yet node is able to
start all necessary components and APIs like REST processor or JMX
interface.

When in maintenance, we'll be able to do defragmentation safely and remove
maintenance request from metastorage only when it is completed (with all
fault-tolerance logic in mind).

As we don't have a mechanism (like watcher) to perform a "safe restart" (by
safe I mean Ignite restart without OS-level process restart) we cannot
finish maintenance mode without another manual restart but I think it is a
reasonable restriction as maintenance mode shouldn't be an every-day
routine and will be used quite rare.

What do you think about it?

On Tue, May 26, 2020 at 5:58 PM Ivan Bessonov  wrote:

> Hello Igniters,
>
> I'd like to discuss this new IEP with you: [1]. The main idea is to have a
> procedure that relocates
> pages to the top of the file as compact as possible which allows us to
> trim the file and increase its
> fill-factor. It will be configured manually and executed after the restart,
> but before node joins
> topology (it means any load would be impossible during defragmentation). It
> is described in detail
> in the IEP itself, please read it. This topic was also previously discussed
> here on dev-list in [2].
>
> Here I would like to list a few moments that are not as clear and require
> your opinion.
>
>  - what are your overall thoughts on the IEP? Any concerns?
>
>  - maintenance mode - how do we communicate with the node that's not in
> topology? What are
>the options? As far as I know, we have no current tools like this.
>
>  - checkpointer refactoring - these changes will involve intensive writing
> of pages to the storage.
>If we're going to reuse the offheap page model to perform
> defragmentation then the
>checkpointing mechanism will have to be adapted in some form.
>Are you fine with this? Or we need a separate discussion?
>
> [1]
>
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-47%3A+Native+persistence+defragmentation
> [2]
>
> http://apache-ignite-developers.2346864.n4.nabble.com/How-to-free-up-space-on-disc-after-removing-entries-from-IgniteCache-with-enabled-PDS-td39839.html
>
>
> --
> Sincerely yours,
> Ivan Bessonov
>


[jira] [Created] (IGNITE-12878) Improve logging for async writing of Binary Metadata

2020-04-08 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-12878:


 Summary: Improve logging for async writing of Binary Metadata
 Key: IGNITE-12878
 URL: https://issues.apache.org/jira/browse/IGNITE-12878
 Project: Ignite
  Issue Type: Task
Reporter: Sergey Chugunov
Assignee: Sergey Chugunov
 Fix For: 2.8.1


New implementation of writing binary metadata outside of discovery thread was 
introduced in IGNITE-12099 but sufficient debug logging was missing.

To provide enough information in case of debugging we need to add necessary 
logging.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-12876) Test to cover deadlock fix between checkpoint, entry update and ttl-cleanup threads

2020-04-08 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-12876:


 Summary: Test to cover deadlock fix between checkpoint, entry 
update and ttl-cleanup threads
 Key: IGNITE-12876
 URL: https://issues.apache.org/jira/browse/IGNITE-12876
 Project: Ignite
  Issue Type: Test
Reporter: Sergey Chugunov
Assignee: Sergey Chugunov
 Fix For: 2.8.1


IGNITE-12594 ticked fixed deadlock between several threads that was 
reproducible with low probability in unrelated tests.

To improve test coverage of the fix new test dedicated for the deadlock 
situation is needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: deadlock in system pool on meta update

2020-03-17 Thread Sergey Chugunov
Hello Sergey,

Your analysis looks valid to me, we definitely need to investigate this
deadlock and find out how to fix it.

Could you create a ticket and write a test that reproduces the issue with
sufficient probability?

Thanks!

On Mon, Mar 16, 2020 at 8:22 PM Sergey-A Kosarev 
wrote:

> Classification: Public
>
> Hi,
> I've recently tried to apply Ilya's idea (
> https://issues.apache.org/jira/browse/IGNITE-12663) of minimizing thread
> pools and tried to set system pool to 3 in my own tests.
> It caused deadlock on a client node and I think it can happen not only on
> such small pool values.
>
> Details are following:
> I'm not using persistence currently (if it matters).
> On the client note I use ignite compute to  call   a job on every server
> node (there are 3 server nodes in the tests).
>
> Then I've found in logs:
>
> [10:55:21] : [Step 1/1] [2020-03-13 10:55:21,773] {
> grid-timeout-worker-#8} [WARN] [o.a.i.i.IgniteKernal] - Possible thread
> pool starvation detected (no task completed in last 3ms, is system
> thread pool size large enough?)
> [10:55:21] : [Step 1/1] ^-- System thread pool [active=3, idle=0,
> qSize=9]
>
>
> I see in threaddumps that all 3 system pool workers do the same -
> processing of job responses:
>
> "sys-#26" #605 daemon prio=5 os_prio=0 tid=0x64a0a800 nid=0x1f34
> waiting on condition [0x7b91d000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
> at
> org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:177)
> at
> org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:140)
> at
> org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryProcessorImpl.metadata(CacheObjectBinaryProcessorImpl.java:749)
> at
> org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryProcessorImpl$1.metadata(CacheObjectBinaryProcessorImpl.java:250)
> at
> org.apache.ignite.internal.binary.BinaryContext.metadata(BinaryContext.java:1169)
> at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.getOrCreateSchema(BinaryReaderExImpl.java:2005)
> at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.(BinaryReaderExImpl.java:285)
> at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.(BinaryReaderExImpl.java:184)
> at
> org.apache.ignite.internal.binary.BinaryUtils.doReadObject(BinaryUtils.java:1797)
> at
> org.apache.ignite.internal.binary.BinaryUtils.deserializeOrUnmarshal(BinaryUtils.java:2160)
> at
> org.apache.ignite.internal.binary.BinaryUtils.doReadCollection(BinaryUtils.java:2091)
> at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize0(BinaryReaderExImpl.java:1914)
> at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize(BinaryReaderExImpl.java:1714)
> at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.readField(BinaryReaderExImpl.java:1982)
> at
> org.apache.ignite.internal.binary.BinaryFieldAccessor$DefaultFinalClassAccessor.read0(BinaryFieldAccessor.java:702)
> at
> org.apache.ignite.internal.binary.BinaryFieldAccessor.read(BinaryFieldAccessor.java:187)
> at
> org.apache.ignite.internal.binary.BinaryClassDescriptor.read(BinaryClassDescriptor.java:887)
> at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize0(BinaryReaderExImpl.java:1762)
> at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize(BinaryReaderExImpl.java:1714)
> at
> org.apache.ignite.internal.binary.BinaryUtils.doReadObject(BinaryUtils.java:1797)
> at
> org.apache.ignite.internal.binary.BinaryUtils.deserializeOrUnmarshal(BinaryUtils.java:2160)
> at
> org.apache.ignite.internal.binary.BinaryUtils.doReadCollection(BinaryUtils.java:2091)
> at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize0(BinaryReaderExImpl.java:1914)
> at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize(BinaryReaderExImpl.java:1714)
> at
> org.apache.ignite.internal.binary.GridBinaryMarshaller.deserialize(GridBinaryMarshaller.java:306)
> at
> org.apache.ignite.internal.binary.BinaryMarshaller.unmarshal0(BinaryMarshaller.java:100)
> at
> org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:80)
> at
> org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:10493)
> at
> 

Re: MetaStorage key length limitations and Cache Metrics configuration

2020-02-28 Thread Sergey Chugunov
Ivan,

I also don't think this issue is a blocker for 2.8 as it affects only
experimental functionality and only in special cases.

Removing key length limitations in MetaStorage seems more strategic
approach to me but depending on how we decide to approach it (as a local
fix or as part of a broader improvement of MetaStorage internal
implementation) we may target it to 2.8.1 or 2.9.

In the latter case it makes sense to implement key length validation [1]
and include it to 2.8.1 to prevent user from making destructive actions.
Otherwise if we decide to implement [2] earlier and remove this pesky
limitation in 2.8.1 then I'm fine with closing [1] with "Won't fix"
resolution.

Does it make sense to you?

[1] https://issues.apache.org/jira/browse/IGNITE-12721
[2] https://issues.apache.org/jira/browse/IGNITE-12726

On Fri, Feb 28, 2020 at 4:18 PM Maxim Muzafarov  wrote:

> Ivan,
>
>
> This issue doesn't seem to be a blocker for 2.8 release from my point of
> view.
> I think we definitely will have such bugs in future and 2.8.1 is our
> goal for them.
>
> Please, let me know if we should wait for the fix and include it exactly
> in 2.8.
>
> On Fri, 28 Feb 2020 at 15:40, Nikolay Izhikov  wrote:
> >
> > Igniters,
> >
> > I think we can replace cache name with the cache id.
> > This should solve issue with the length limitation.
> >
> > What do you think?
> >
> > > 28 февр. 2020 г., в 15:32, Ivan Bessonov 
> написал(а):
> > >
> > > Hello Igniters,
> > >
> > > we have an issue in master branch and in the upcoming 2.8 release that
> > > related to new metrics functionality implemented in [1]. You can't use
> new
> > > "configureHistogramMetric" and "configureHitRateMetric" configuration
> > > methods on caches with long names. My estimation shows that cache with
> 30
> > > characters in its name will shut down your whole cluster with failure
> > > handler if
> > > you try to change metrics configuration for it using one of those
> methods.
> > >
> > > Initially we wanted to merge [2] to show a valid error message instead
> of
> > > failing
> > > the cluster, but it wasn't in plans for 2.8 because we didn't know
> that it
> > > clashes
> > > with [1].
> > >
> > > I created issue [3] with plans of removing MetaStorage key length
> > > limitations, but
> > > it requires some thoughtful MetaStorageTree reworkings. I mean that it
> > > can't be
> > > done in only a few days.
> > >
> > > What do you think? Does this issue affect 2.8 release? AFAIK new
> metrics are
> > > experimental and they can have some known issues. Feel free to ask me
> for
> > > more
> > > details if it's needed.
> > >
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-11987
> > > [2] https://issues.apache.org/jira/browse/IGNITE-12721
> > > [3] https://issues.apache.org/jira/browse/IGNITE-12726
> > >
> > > --
> > > Sincerely yours,
> > > Ivan Bessonov
> >
>


[jira] [Created] (IGNITE-12721) Validation of key length written to Distributed Metastorage

2020-02-27 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-12721:


 Summary: Validation of key length written to Distributed 
Metastorage
 Key: IGNITE-12721
 URL: https://issues.apache.org/jira/browse/IGNITE-12721
 Project: Ignite
  Issue Type: Task
Reporter: Sergey Chugunov
Assignee: Sergey Chugunov
 Fix For: 2.9


DistributedMetastorage functionality introduced in IGNITE-10640 provides 
convenient way to perform coordinated writes to local MetaStorages on all 
server nodes but lacks important part: validation of key length.

Current implementation of MetaStorage doesn't allow keys longer than a specific 
value (64 bytes minus some prefixes, see source code for details) and throws 
assertion error on an attempt to write longer key.

This error from MetaStorage is not propagated to DistributedMetastorage and (in 
theory) may even cause a node to halt.

In order to avoid this situation validation of key length should be added right 
to DistributedMetastorage implementation to enforce "fail-fast" principle and 
preserve Ignite nodes from potentially dangerous consequences.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] Allow or prohibit a joint use of @deprecated and @IgniteExperimental

2020-02-12 Thread Sergey Chugunov
-1 Prohibit.

To me as a developer the situation when old but stable API is deprecated
with only experimental (thus unstable/unfinished) alternative is very far
from comfortable.
And from outside folks it may look like as a sign of immature processes
inside Ignite community (which is definitely not the case) and reduce
overall users' impression.

On Tue, Feb 11, 2020 at 2:20 PM Andrey Gura  wrote:

> -1 Prohibit
>
> On Mon, Feb 10, 2020 at 11:02 AM Alexey Goncharuk 
> wrote:
> >
> > Dear Apache Ignite community,
> >
> > We would like to conduct a formal vote on the subject of whether to allow
> > or prohibit a joint existence of @deprecated annotation for an old API
> > and @IgniteExperimental [1] for a new (replacement) API. The result of
> this
> > vote will be formalized as an Apache Ignite development rule to be used
> in
> > future.
> >
> > The discussion thread where you can address all non-vote messages is [2].
> >
> > The votes are:
> > *[+1 Allow]* Allow to deprecate the old APIs even when new APIs are
> marked
> > with @IgniteExperimental to explicitly notify users that an old APIs will
> > be removed in the next major release AND new APIs are available.
> > *[-1 Prohibit]* Never deprecate the old APIs unless the new APIs are
> stable
> > and released without @IgniteExperimental. The old APIs javadoc may be
> > updated with a reference to new APIs to encourage users to evaluate new
> > APIs. The deprecation and new API release may happen simultaneously if
> the
> > new API is not marked with @IgniteExperimental or the annotation is
> removed
> > in the same release.
> >
> > Neither of the choices prohibits deprecation of an API without a
> > replacement if community decides so.
> >
> > The vote will hold for 72 hours and will end on February 13th 2020 08:00
> > UTC:
> >
> https://www.timeanddate.com/countdown/to?year=2020=2=13=8=0=0=utc-1
> >
> > All votes count, there is no binding/non-binding status for this.
> >
> > [1]
> >
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/lang/IgniteExperimental.java
> > [2]
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/DISCUSS-Public-API-deprecation-rules-td45647.html
> >
> > Thanks,
> > --AG
>


[jira] [Created] (IGNITE-12646) When DEBUG mode is enabled GridToStringBuilder may throw java.util.ConcurrentModificationException

2020-02-09 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-12646:


 Summary: When DEBUG mode is enabled GridToStringBuilder may throw 
java.util.ConcurrentModificationException
 Key: IGNITE-12646
 URL: https://issues.apache.org/jira/browse/IGNITE-12646
 Project: Ignite
  Issue Type: Bug
Reporter: Sergey Chugunov
Assignee: Sergey Chugunov
 Fix For: 2.9


With DEBUG enabled many components like CommunicationSPI start to log much 
larger chunks of information e.g. communication messages are logged as is.

When big enough message with non-thread safe collection inside is logged by 
communication thread it is possible that some other thread started processing 
the same message. If processing involves modifying of the collection 
communication thread will get ConcurrentModificationException when in the 
middle of iterating over it.

GridToStringBuilder should be safe from throwing this exception and 
(optionally) any type of RuntimeException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: AWS EBS Discovery: Contributor Wanted

2020-01-28 Thread Sergey Chugunov
Denis, Emmanouil,

Sure, I'll take a look at the code shortly.

--
Thank you,
Sergey.

On Mon, Jan 27, 2020 at 8:59 PM Denis Magda  wrote:

> I support the idea of triggering such tests on demand. We can create a wiki
> page with instructions on how to run the tests. Unless there is a more
> elegant solution.
>
> Sergey, would you be able to review Emmanouil's changes in the IP Finder
> source code?
> https://issues.apache.org/jira/browse/IGNITE-8617
>
> -
> Denis
>
>
> On Sun, Jan 26, 2020 at 2:22 AM Emmanouil Gkatziouras <
> gkatzio...@gmail.com>
> wrote:
>
> > Hi all!
> >
> > I do believe being able to execute some AWS integration tests on demand
> > would be of value, especially for reviewers who cannot have an AWS stack
> > created on demand.
> > More than happy to help on that.
> >
> > Kind regards
> > *Emmanouil Gkatziouras*
> > https://egkatzioura.com/ |
> > https://www.linkedin.com/in/gkatziourasemmanouil/
> > https://github.com/gkatzioura
> >
> >
> > On Fri, 24 Jan 2020 at 15:15, Sergey Chugunov  >
> > wrote:
> >
> > > Hello Emmanouil,
> > >
> > > It would be great if we have at least basic integration tests in real
> AWS
> > > environment. Even though they may require more work to keep them green
> (I
> > > mean here AWS quotas and additional configuration/reconfiguration
> > efforts)
> > > it worth it because these tests can also be useful as an examples.
> > >
> > > As the same time as IpFinder is such a basic component I don't think we
> > > need to include them in constantly triggered suites like Run All but to
> > > trigger them manually before/right after merging them to master branch
> or
> > > when developing something in related code.
> > >
> > > What do you think?
> > >
> > > --
> > > Thank you,
> > > Sergey Chugunov.
> > >
> > > On Thu, Jan 23, 2020 at 5:32 PM Emmanouil Gkatziouras <
> > > gkatzio...@gmail.com>
> > > wrote:
> > >
> > > > Hi all!
> > > >
> > > > Yes It seems possible to get some free quota for integration tests on
> > AWS
> > > > [1] however most probably they are not gonna last forever.
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://aws.amazon.com/blogs/opensource/aws-promotional-credits-open-source-projects/
> > > >
> > > > King Regards
> > > > *Emmanouil Gkatziouras*
> > > > https://egkatzioura.com/ |
> > > > https://www.linkedin.com/in/gkatziourasemmanouil/
> > > > https://github.com/gkatzioura
> > > >
> > > >
> > > > On Wed, 22 Jan 2020 at 16:48, Denis Magda  wrote:
> > > >
> > > > > Hi Emmanouil,
> > > > >
> > > > > Thanks for preparing a pull-request for Application Load Balancer:
> > > > > https://issues.apache.org/jira/browse/IGNITE-8617
> > > > >
> > > > > Igniters, who is willing to step in as a primary reviewer?
> > > > >
> > > > > As for automated testing on AWS, are you aware of any sponsorship
> > > program
> > > > > of AWS for open source projects of our kind? It will be ideal to
> have
> > > > real
> > > > > testing on AWS but someone needs to pay.
> > > > >
> > > > > -
> > > > > Denis
> > > > >
> > > > >
> > > > > On Sun, Jan 19, 2020 at 6:45 AM Emmanouil Gkatziouras <
> > > > > gkatzio...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi all!
> > > > > >
> > > > > > I have spinned up an Application Load Balancer and an autoscaling
> > > group
> > > > > on
> > > > > > AWS and the Ignite discovery using TcpDiscoveryAlbIpFinder works
> as
> > > > > > expected.
> > > > > >
> > > > > >- On startup nodes discover each other.
> > > > > >- On ec2 node down, connection is lost and the cluster
> > decreases.
> > > > > >- On an extra node addition the cluster size increases
> > > > > >
> > > > > > This contribution is essential since the Previous ELB based
> > discovery
> > > > > uses
> > > > > > the Classic Load Balancer which is still available h

Re: AWS EBS Discovery: Contributor Wanted

2020-01-24 Thread Sergey Chugunov
Hello Emmanouil,

It would be great if we have at least basic integration tests in real AWS
environment. Even though they may require more work to keep them green (I
mean here AWS quotas and additional configuration/reconfiguration efforts)
it worth it because these tests can also be useful as an examples.

As the same time as IpFinder is such a basic component I don't think we
need to include them in constantly triggered suites like Run All but to
trigger them manually before/right after merging them to master branch or
when developing something in related code.

What do you think?

--
Thank you,
Sergey Chugunov.

On Thu, Jan 23, 2020 at 5:32 PM Emmanouil Gkatziouras 
wrote:

> Hi all!
>
> Yes It seems possible to get some free quota for integration tests on AWS
> [1] however most probably they are not gonna last forever.
>
> [1]
>
> https://aws.amazon.com/blogs/opensource/aws-promotional-credits-open-source-projects/
>
> King Regards
> *Emmanouil Gkatziouras*
> https://egkatzioura.com/ |
> https://www.linkedin.com/in/gkatziourasemmanouil/
> https://github.com/gkatzioura
>
>
> On Wed, 22 Jan 2020 at 16:48, Denis Magda  wrote:
>
> > Hi Emmanouil,
> >
> > Thanks for preparing a pull-request for Application Load Balancer:
> > https://issues.apache.org/jira/browse/IGNITE-8617
> >
> > Igniters, who is willing to step in as a primary reviewer?
> >
> > As for automated testing on AWS, are you aware of any sponsorship program
> > of AWS for open source projects of our kind? It will be ideal to have
> real
> > testing on AWS but someone needs to pay.
> >
> > -
> > Denis
> >
> >
> > On Sun, Jan 19, 2020 at 6:45 AM Emmanouil Gkatziouras <
> > gkatzio...@gmail.com>
> > wrote:
> >
> > > Hi all!
> > >
> > > I have spinned up an Application Load Balancer and an autoscaling group
> > on
> > > AWS and the Ignite discovery using TcpDiscoveryAlbIpFinder works as
> > > expected.
> > >
> > >- On startup nodes discover each other.
> > >- On ec2 node down, connection is lost and the cluster decreases.
> > >- On an extra node addition the cluster size increases
> > >
> > > This contribution is essential since the Previous ELB based discovery
> > uses
> > > the Classic Load Balancer which is still available however
> > > AWS advices users to use the Application one. [1]
> > > While my pull request gets reviewed I will also have a look at
> > > the IGNITE-12398 [2] issue which has to do with the S3 discovery.
> > > Another idea would also be to implement a `TCP Load Balancer based`
> > > discovery.
> > >
> > > In order to test this issue and future ones I implemented some
> terraform
> > > scripts (which I shall use for other issues too) [3].
> > > If some automated e2e testing on AWS is being considered they might be
> of
> > > value.
> > > I can help on implementing those tests by provisioning the
> infrastructure
> > > in an automated way and validate the discovery.
> > >
> > > [1]
> > >
> > >
> >
> https://docs.aws.amazon.com/elasticloadbalancing/latest/userguide/migrate-to-application-load-balancer.html
> > > [2] https://issues.apache.org/jira/browse/IGNITE-12398
> > > [3] https://github.com/gkatzioura/ignite-aws-deploy
> > >
> > > Kind regards,
> > > *Emmanouil Gkatziouras*
> > > https://egkatzioura.com/ |
> > > https://www.linkedin.com/in/gkatziourasemmanouil/
> > > https://github.com/gkatzioura
> > >
> > >
> > > On Tue, 14 Jan 2020 at 22:22, Denis Magda  wrote:
> > >
> > > > Hi Emmanouil,
> > > >
> > > > Agree, let's check that the IP finder functions normally in the cloud
> > > > environment and the mock tests can be used for regular testing on
> Team
> > > > City. That's the way we tested other environment-specific IP finders
> > > > including the Kubernetes one.
> > > >
> > > > Let us know once the IP finder is tested on AWS and then we can
> proceed
> > > > with the review.
> > > >
> > > > -
> > > > Denis
> > > >
> > > >
> > > > On Tue, Jan 14, 2020 at 2:47 AM Emmanouil Gkatziouras <
> > > > gkatzio...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi all!
> > > > >
> > > > > With regards to the `Node Discovery Using AWS Application ELB`
> issue
> > > [1]
> >

New blog post on Apache Ignite in AWS

2020-01-22 Thread Sergey Chugunov
Hello community,

Recently I published a new blog post on getting started with Apache Ignite
in AWS [1]. I tried to make my example as simple as possible while keeping
it usable.

Let me know if this post is useful for you.

I plan to write several follow-up posts about AWS-specific things but based
on feedback may cover other topics in more detail.

Any feedback is welcome, thank you!

[1] https://www.gridgain.com/node/6247


[jira] [Created] (IGNITE-12439) More descriptive message in situation of IgniteOutOfMemoryException, warning message if risk of IOOME is found

2019-12-12 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-12439:


 Summary: More descriptive message in situation of 
IgniteOutOfMemoryException, warning message if risk of IOOME is found
 Key: IGNITE-12439
 URL: https://issues.apache.org/jira/browse/IGNITE-12439
 Project: Ignite
  Issue Type: Improvement
Reporter: Sergey Chugunov
Assignee: Sergey Chugunov
 Fix For: 2.9


In persistent mode starting many caches in a data region of a small size may 
lead to IgniteOutOfMemoryException being thrown.

The root cause is that each partition requires allocation of one or more 
metapages that should be stored during checkpoint and cannot be replaced by 
other types of pages.
As a result when too many metapages occupy significant portion of data region's 
space a request to replace a page in memory (with one on disk) may not be able 
to find clean page for replacement. In this situation 
IgniteOutOfMemoryException is thrown.

It is not easy to prevent IOOME in general case, but we should provide more 
descriptive message when the exception is thrown and/or print out warning to 
logs when too many caches (or one cache with huge number of partitions) are 
started in the same data region.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: IgniteOutOfMemoryException in LOCAL cache mode with persistence enabled

2019-12-11 Thread Sergey Chugunov
Hi Mitchell,

I believe that research done by Anton is correct, and the root cause of the
OOME is proportion of memory occupied by metapages in data region. Each
cache started in data region allocates one or more metapages per
initialized partition so when you run your test with only one cache this is
not a problem, but when second cache is added this results in OOME.

I don't think there is an easy way to prevent this exception in general but
I agree that we need to provide more descriptive error message and/or early
warning for the user that configuration of caches and data regions may lead
to such exception. I'll file a ticket for this improvement soon.

Best regards,
Sergey

On Thu, Dec 12, 2019 at 1:27 AM Denis Magda  wrote:

> I tend to agree with Mitchell that the cluster should not crash. If the
> crash is unavoidable based on the current architecture then a message
> should be more descriptive.
>
> Ignite persistence experts, could you please join the conversation and
> shed more light to the reported behavior?
>
> -
> Denis
>
>
> On Wed, Dec 11, 2019 at 3:25 AM Mitchell Rathbun (BLOOMBERG/ 731 LEX) <
> mrathb...@bloomberg.net> wrote:
>
>> 2 GB is not reasonable for off heap memory for our use case. In general,
>> even if off-heap is very low, performance should just degrade and calls
>> should become blocking, I don't think that we should crash. Either way, the
>> issue seems to be with putAll, not concurrent updates of different caches
>> in the same data region. If I use Ignite's DataStreamer API instead of
>> putAll, I get much better performance and no OOM exception. Any insight
>> into why this might be would be appreciated.
>>
>> From: u...@ignite.apache.org At: 12/10/19 11:24:35
>> To: Mitchell Rathbun (BLOOMBERG/ 731 LEX ) ,
>> u...@ignite.apache.org
>> Subject: Re: IgniteOutOfMemoryException in LOCAL cache mode with
>> persistence enabled
>>
>> Hello!
>>
>> 10M is very very low-ball for testing performance of disk, considering
>> how Ignite's wal/checkpoints are structured. As already told, it does not
>> even work properly.
>>
>> I recommend using 2G value instead. Just load enough data so that you can
>> observe constant checkpoints.
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> ср, 4 дек. 2019 г. в 03:16, Mitchell Rathbun (BLOOMBERG/ 731 LEX) <
>> mrathb...@bloomberg.net>:
>>
>>> For the requested full ignite log, where would this be found if we are
>>> running using local mode? We are not explicitly running a separate ignite
>>> node, and our WorkDirectory does not seem to have any logs
>>>
>>> From: u...@ignite.apache.org At: 12/03/19 19:00:18
>>> To: u...@ignite.apache.org
>>> Subject: Re: IgniteOutOfMemoryException in LOCAL cache mode with
>>> persistence enabled
>>>
>>> For our configuration properties, our DataRegion initialSize and MaxSize
>>> was set to 11 MB and persistence was enabled. For DataStorage, our pageSize
>>> was set to 8192 instead of 4096. For Cache, write behind is disabled, on
>>> heap cache is disabled, and Atomicity Mode is Atomic
>>>
>>> From: u...@ignite.apache.org At: 12/03/19 13:40:32
>>> To: u...@ignite.apache.org
>>> Subject: Re: IgniteOutOfMemoryException in LOCAL cache mode with
>>> persistence enabled
>>>
>>> Hi Mitchell,
>>>
>>> Looks like it could be easily reproduced on low off-heap sizes, I tried
>>> with
>>> simple puts and got the same exception:
>>>
>>> class org.apache.ignite.internal.mem.IgniteOutOfMemoryException: Failed
>>> to
>>> find a page for eviction [segmentCapacity=1580, loaded=619,
>>> maxDirtyPages=465, dirtyPages=619, cpPages=0, pinnedInSegment=0,
>>> failedToPrepare=620]
>>> Out of memory in data region [name=Default_Region, initSize=10.0 MiB,
>>> maxSize=10.0 MiB, persistenceEnabled=true] Try the following:
>>> ^-- Increase maximum off-heap memory size
>>> (DataRegionConfiguration.maxSize)
>>> ^-- Enable Ignite persistence
>>> (DataRegionConfiguration.persistenceEnabled)
>>> ^-- Enable eviction or expiration policies
>>>
>>> It looks like Ignite must issue a proper warning in this case and couple
>>> of
>>> issues must be filed against Ignite JIRA.
>>>
>>> Check out this article on persistent store available in Ignite
>>> confluence as
>>> well:
>>>
>>> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+und
>>> er+the+hood#IgnitePersistentStore-underthehood-Checkpointing
>>>
>>> I've managed to make kind of similar example working with 20 Mb region
>>> with
>>> a bit of tuning, added following properties to
>>> org.apache.ignite.configuration.DataStorageConfiguration:
>>> /
>>> /
>>>
>>> The whole idea behind this is to trigger checkpoint on timeout rather
>>> than
>>> on too much dirty pages percentage threshold. The checkpoint page buffer
>>> size may not exceed data region size, which is 10 Mb, which might be
>>> overflown during checkpoint as well.
>>>
>>> I assume that checkpoint is never triggered in this case because of
>>> per-partition overhead: Ignite writes some meta per partition and it

Contribution to Apache Ignite

2019-11-20 Thread Sergey Chugunov
Hello Lev,

My name is Sergey, I'm from Apache Ignite community. As I can see you
successfully completed your first ticket but there are some review comments
on your second one [1].

Do you need any assistance with resolving them?

Also if you're interested in more challenging tasks, there are plenty of
them and we could figure out what to pick up next.

Anyway, thank you for your interest to our project and community!

[1] https://issues.apache.org/jira/browse/IGNITE-11312

--
Best regards,
Sergey Chugunov.


Re: Binary object format KB article

2019-10-16 Thread Sergey Chugunov
Then I would suggest to define good terminology at the very beginning of
the article.

Right in introduction section I see a lot of terms like "Binary object
format", "Binary object container format" (is it the same thing?), "Binary
serialization format". In the next section "binary type" pops up. What are
the relations between them?

Schemes part needs more examples. What is scheme? How it is related to
binary type? Is it a one-to-one relationship? One-to-many? When a new
scheme is created? Why type and scheme should be registered on a receiver
side? And if the receiver exists then who is the sender?

It seems to me that document tries to focus on details of the format itself
but other aspects of this functionality leak into the explanation and
confuses reader.

On Wed, Oct 16, 2019 at 2:52 PM Ivan Pavlukhin  wrote:

> Pavel, Sergey,
>
> Thank you for your feedback!
>
> To be exact the document does not describe broad picture (including
> metadata exchange) and is not a formal format specification
> intentionally. I wanted to create a lightweight article giving an
> intuition about binary object structure to a reader. And yes,
> intuition about metadata registration is definitely an important,
> related but slightly different subject.
>
> ср, 16 окт. 2019 г. в 14:23, Sergey Chugunov :
> >
> > Ivan, thank you for documenting this functionality, agree with Pavel
> here.
> >
> > I think this document is a good starting point and contains a lot of
> > low-level details and great examples but from my perspective it doesn't
> > show how binary objects fit into a broader picture.
> >
> > It worth adding higher-level details and restructure the document into a
> > top-down article starting from where binary format is used
> (representation
> > of objects in cache, binary protocol for communications with thin
> clients)
> > and down to lower details like binary metadata exchange and serialization
> > and container formats.
> >
> > Another option would be to leave the document focused on a low-level
> > details as it is now but build around it drafts for documents describing
> > other aspects of Binary Objects.
> > This will make our documentation much more solid and useful for readers.
> >
> > On Wed, Oct 16, 2019 at 2:07 PM Pavel Tupitsyn 
> wrote:
> >
> > > Ivan, great job, thanks for putting this together.
> > >
> > > I think we also need a more formal description of the format, including
> > > binary metadata exchange mechanics.
> > > It was done (partially) for IEP-9 Thin Client Protocol, we should
> probably
> > > copy from there:
> > >
> > >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-9+Thin+Client+Protocol#IEP-9ThinClientProtocol-BinaryObjects
> > >
> > >
> > >
> > > On Wed, Oct 16, 2019 at 11:49 AM Ivan Pavlukhin 
> > > wrote:
> > >
> > > > Igniters,
> > > >
> > > > I published a document about Binary format in cwiki [1]. Please share
> > > > your feedback. I feel that there is a lack of pictures on the page.
> > > > Need to figure out what aspects will be more clear with pictures.
> > > >
> > > > [1]
> > > >
> https://cwiki.apache.org/confluence/display/IGNITE/Binary+object+format
> > > >
> > > > --
> > > > Best regards,
> > > > Ivan Pavlukhin
> > > >
> > >
>
>
>
> --
> Best regards,
> Ivan Pavlukhin
>


Re: Binary object format KB article

2019-10-16 Thread Sergey Chugunov
Ivan, thank you for documenting this functionality, agree with Pavel here.

I think this document is a good starting point and contains a lot of
low-level details and great examples but from my perspective it doesn't
show how binary objects fit into a broader picture.

It worth adding higher-level details and restructure the document into a
top-down article starting from where binary format is used (representation
of objects in cache, binary protocol for communications with thin clients)
and down to lower details like binary metadata exchange and serialization
and container formats.

Another option would be to leave the document focused on a low-level
details as it is now but build around it drafts for documents describing
other aspects of Binary Objects.
This will make our documentation much more solid and useful for readers.

On Wed, Oct 16, 2019 at 2:07 PM Pavel Tupitsyn  wrote:

> Ivan, great job, thanks for putting this together.
>
> I think we also need a more formal description of the format, including
> binary metadata exchange mechanics.
> It was done (partially) for IEP-9 Thin Client Protocol, we should probably
> copy from there:
>
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-9+Thin+Client+Protocol#IEP-9ThinClientProtocol-BinaryObjects
>
>
>
> On Wed, Oct 16, 2019 at 11:49 AM Ivan Pavlukhin 
> wrote:
>
> > Igniters,
> >
> > I published a document about Binary format in cwiki [1]. Please share
> > your feedback. I feel that there is a lack of pictures on the page.
> > Need to figure out what aspects will be more clear with pictures.
> >
> > [1]
> > https://cwiki.apache.org/confluence/display/IGNITE/Binary+object+format
> >
> > --
> > Best regards,
> > Ivan Pavlukhin
> >
>


Cluster ID and tag: identification of the cluster

2019-08-28 Thread Sergey Chugunov
Hello folks,

I would like to propose implementing new properties to identify the cluster
and simplify managing it by external tools (e.g. custom scripts built on
top of standard Control Utility).

These properties are Cluster ID (UUID) and Cluster tag (String) exposed
through existing IgniteCluster public API.

Both properties are generated upon cluster startup (before activation if
the cluster requires it) and survived restarts if cluster is configured in
persistent mode and regenerated again if it is an in-memory cluster.

Cluster ID is immutable and useful more for automated tools while Cluster
tag is mutable (may be changed by user) and is supposed to be human
readable to view it in any GUI or web-based management solutions.

I already created a ticket [1] with some more technical details and invite
community to discuss this feature.

[1] https://issues.apache.org/jira/browse/IGNITE-12111
--
Best Regards,
Sergey Chugunov.


[jira] [Created] (IGNITE-12111) Cluster ID and tag: properties to identify the cluster

2019-08-27 Thread Sergey Chugunov (Jira)
Sergey Chugunov created IGNITE-12111:


 Summary: Cluster ID and tag: properties to identify the cluster
 Key: IGNITE-12111
 URL: https://issues.apache.org/jira/browse/IGNITE-12111
 Project: Ignite
  Issue Type: New Feature
Reporter: Sergey Chugunov
Assignee: Sergey Chugunov
 Fix For: 2.8


To improve cluster management capabilities two new properties of the cluster 
are introduced:

# A unique cluster ID (may be either a random UUID or random IgniteUUID). 
Generated upon cluster start and saved to the distributed metastorage. 
Immutable. Persistent clusters must persist the value. In-memory clusters 
should keep the generated ID in memory and regenerate it upon restart.
# Human-readable cluster tag. Generated by default to some random (but still 
meaningful) string, may be changed by user. Again survives restart in 
persistent clusters and is regenerated in in-memory clusters on every restart.

These properties are exposed to standard APIs:
# EVT_CLUSTER_TAG_CHANGED event generated when tag is changed by user;
# JMX bean and control utility APIs to view ID and tag and change tag.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: Asynchronous registration of binary metadata

2019-08-23 Thread Sergey Chugunov
ested.
> We
> > do
> > > > not
> > > > > need another copy/paste code.
> > > > >
> > > > > Another possibility is to carry metadata along with appropriate
> > request
> > > > if
> > > > > it's not found locally but this is a rather big modification.
> > > > >
> > > > >
> > > > >
> > > > > вт, 20 авг. 2019 г. в 17:26, Denis Mekhanikov <
> dmekhani...@gmail.com
> > >:
> > > > >
> > > > > > Eduard,
> > > > > >
> > > > > > Usages will wait for the metadata to be registered and written to
> > disk.
> > > > No
> > > > > > races should occur with such flow.
> > > > > > Or do you have some specific case on your mind?
> > > > > >
> > > > > > I agree, that using a distributed meta storage would be nice
> here.
> > > > > > But this way we will kind of move to the previous scheme with a
> > > > replicated
> > > > > > system cache, where metadata was stored before.
> > > > > > Will scheme with the metastorage be different in any way? Won’t
> we
> > > > decide
> > > > > > to move back to discovery messages again after a while?
> > > > > >
> > > > > > Denis
> > > > > >
> > > > > >
> > > > > > > On 20 Aug 2019, at 15:13, Eduard Shangareev <
> > > > eduard.shangar...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > Denis,
> > > > > > > How would we deal with races between registration and metadata
> > usages
> > > > > > with
> > > > > > > such fast-fix?
> > > > > > >
> > > > > > > I believe, that we need to move it to distributed metastorage,
> > and
> > > > await
> > > > > > > registration completeness if we can't find it (wait for work in
> > > > > > progress).
> > > > > > > Discovery shouldn't wait for anything here.
> > > > > > >
> > > > > > > On Tue, Aug 20, 2019 at 11:55 AM Denis Mekhanikov <
> > > > dmekhani...@gmail.com
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Sergey,
> > > > > > > >
> > > > > > > > Currently metadata is written to disk sequentially on every
> > node. Only
> > > > > > one
> > > > > > > > node at a time is able to write metadata to its storage.
> > > > > > > > Slowness accumulates when you add more nodes. A delay
> required
> > to
> > > > write
> > > > > > > > one piece of metadata may be not that big, but if you
> multiply
> > it by
> > > > say
> > > > > > > > 200, then it becomes noticeable.
> > > > > > > > But If we move the writing out from discovery threads, then
> > nodes will
> > > > > > be
> > > > > > > > doing it in parallel.
> > > > > > > >
> > > > > > > > I think, it’s better to block some threads from a striped
> pool
> > for a
> > > > > > > > little while rather than blocking discovery for the same
> > period, but
> > > > > > > > multiplied by a number of nodes.
> > > > > > > >
> > > > > > > > What do you think?
> > > > > > > >
> > > > > > > > Denis
> > > > > > > >
> > > > > > > > > On 15 Aug 2019, at 10:26, Sergey Chugunov <
> > sergey.chugu...@gmail.com
> > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Denis,
> > > > > > > > >
> > > > > > > > > Thanks for bringing this issue up, decision to write binary
> > metadata
> > > > > > from
> > > > > > > > > discovery thread was really a tough decision to make.
> > > > > > > > > I don't think that moving metadata to metastorage is a
> > silver bullet
> > > > > > here
> > > > >

Re: Re[2]: Asynchronous registration of binary metadata

2019-08-15 Thread Sergey Chugunov
Denis,

Thanks for bringing this issue up, decision to write binary metadata from
discovery thread was really a tough decision to make.
I don't think that moving metadata to metastorage is a silver bullet here
as this approach also has its drawbacks and is not an easy change.

In addition to workarounds suggested by Alexei we have two choices to
offload write operation from discovery thread:

   1. Your scheme with a separate writer thread and futures completed when
   write operation is finished.
   2. PME-like protocol with obvious complications like failover and
   asynchronous wait for replies over communication layer.

Your suggestion looks easier from code complexity perspective but in my
view it increases chances to get into starvation. Now if some node faces
really long delays during write op it is gonna be kicked out of topology by
discovery protocol. In your case it is possible that more and more threads
from other pools may stuck waiting on the operation future, it is also not
good.

What do you think?

I also think that if we want to approach this issue systematically, we need
to do a deep analysis of metastorage option as well and to finally choose
which road we wanna go.

Thanks!

On Thu, Aug 15, 2019 at 9:28 AM Zhenya Stanilovsky
 wrote:

>
> >
> >> 1. Yes, only on OS failures. In such case data will be received from
> alive
> >> nodes later.
> What behavior would be in case of one node ? I suppose someone can obtain
> cache data without unmarshalling schema, what in this case would be with
> grid operability?
>
> >
> >> 2. Yes, for walmode=FSYNC writes to metastore will be slow. But such
> mode
> >> should not be used if you have more than two nodes in grid because it
> has
> >> huge impact on performance.
> Is wal mode affects metadata store ?
>
> >
> >>
> >> ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov < dmekhani...@gmail.com
> >:
> >>
> >>> Folks,
> >>>
> >>> Thanks for showing interest in this issue!
> >>>
> >>> Alexey,
> >>>
>  I think removing fsync could help to mitigate performance issues with
> >>> current implementation
> >>>
> >>> Is my understanding correct, that if we remove fsync, then discovery
> won’t
> >>> be blocked, and data will be flushed to disk in background, and loss of
> >>> information will be possible only on OS failure? It sounds like an
> >>> acceptable workaround to me.
> >>>
> >>> Will moving metadata to metastore actually resolve this issue? Please
> >>> correct me if I’m wrong, but we will still need to write the
> information to
> >>> WAL before releasing the discovery thread. If WAL mode is FSYNC, then
> the
> >>> issue will still be there. Or is it planned to abandon the
> discovery-based
> >>> protocol at all?
> >>>
> >>> Evgeniy, Ivan,
> >>>
> >>> In my particular case the data wasn’t too big. It was a slow
> virtualised
> >>> disk with encryption, that made operations slow. Given that there are
> 200
> >>> nodes in a cluster, where every node writes slowly, and this process is
> >>> sequential, one piece of metadata is registered extremely slowly.
> >>>
> >>> Ivan, answering to your other questions:
> >>>
>  2. Do we need a persistent metadata for in-memory caches? Or is it so
> >>> accidentally?
> >>>
> >>> It should be checked, if it’s safe to stop writing marshaller mappings
> to
> >>> disk without loosing any guarantees.
> >>> But anyway, I would like to have a property, that would control this.
> If
> >>> metadata registration is slow, then initial cluster warmup may take a
> >>> while. So, if we preserve metadata on disk, then we will need to warm
> it up
> >>> only once, and further restarts won’t be affected.
> >>>
>  Do we really need a fast fix here?
> >>>
> >>> I would like a fix, that could be implemented now, since the activity
> with
> >>> moving metadata to metastore doesn’t sound like a quick one. Having a
> >>> temporary solution would be nice.
> >>>
> >>> Denis
> >>>
>  On 14 Aug 2019, at 11:53, Павлухин Иван < vololo...@gmail.com >
> wrote:
> 
>  Denis,
> 
>  Several clarifying questions:
>  1. Do you have an idea why metadata registration takes so long? So
>  poor disks? So many data to write? A contention with disk writes by
>  other subsystems?
>  2. Do we need a persistent metadata for in-memory caches? Or is it so
>  accidentally?
> 
>  Generally, I think that it is possible to move metadata saving
>  operations out of discovery thread without loosing required
>  consistency/integrity.
> 
>  As Alex mentioned using metastore looks like a better solution. Do we
>  really need a fast fix here? (Are we talking about fast fix?)
> 
>  ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky
> >>> < arzamas...@mail.ru.invalid >:
> >
> > Alexey, but in this case customer need to be informed, that whole
> (for
> >>> example 1 node) cluster crash (power off) could lead to partial data
> >>> unavailability.
> > And may be further index corruption.

[jira] [Created] (IGNITE-11952) Bug fixes and improvements in console utilities & test fixes

2019-07-02 Thread Sergey Chugunov (JIRA)
Sergey Chugunov created IGNITE-11952:


 Summary: Bug fixes and improvements in console utilities & test 
fixes
 Key: IGNITE-11952
 URL: https://issues.apache.org/jira/browse/IGNITE-11952
 Project: Ignite
  Issue Type: Bug
Reporter: Sergey Chugunov
Assignee: Sergey Chugunov
 Fix For: 2.8






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-11865) FailureProcessor treats tcp-comm-worker as blocked when it works on reestablishing connect to failed client node

2019-05-22 Thread Sergey Chugunov (JIRA)
Sergey Chugunov created IGNITE-11865:


 Summary: FailureProcessor treats tcp-comm-worker as blocked when 
it works on reestablishing connect to failed client node
 Key: IGNITE-11865
 URL: https://issues.apache.org/jira/browse/IGNITE-11865
 Project: Ignite
  Issue Type: Bug
Affects Versions: 2.7
Reporter: Sergey Chugunov
Assignee: Sergey Chugunov
 Fix For: 2.8


When client node fails tcp-comm-worker thread on server keeps trying to 
reestablish connection to the client until failed node is removed from topology 
(on expiration of clientFailureDetectionTimeout).

As tcp-comm-worker thread doesn't update its heartbeats from internal loops 
FailureProcessor considers it as blocked and prints out misleading message to 
logs along with full thread dump.

To avoid polluting logs with unnecessary messages we need to teach 
tcp-comm-worker how to update its heartbeat timestamp in FailureProcessor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-11743) Stopping caches concurrently with node join may lead to crash of the node

2019-04-15 Thread Sergey Chugunov (JIRA)
Sergey Chugunov created IGNITE-11743:


 Summary: Stopping caches concurrently with node join may lead to 
crash of the node
 Key: IGNITE-11743
 URL: https://issues.apache.org/jira/browse/IGNITE-11743
 Project: Ignite
  Issue Type: Bug
Affects Versions: 2.7
Reporter: Sergey Chugunov
Assignee: Sergey Chugunov
 Attachments: IgnitePdsNodeRestartCacheCreateTest.java

When an existing cache is stopped (e.g. via call Ignite#destroyCache(String 
name)) this action is distributed across cluster by discovery mechanism (and is 
processed from *disco-notifier-worker* thread).
At the same time joining node prepares to start caches from *exchange-thread*.

If a cache stop request arrives to new node right in the middle of cache start 
prepare, it may lead to exception in FilePageStoreManager like one below and 
node crash.

Test reproducing the issue is attached.

{noformat}
class org.apache.ignite.IgniteCheckedException: Failed to get page store for 
the given cache ID (cache has not been started): -1422502786
at 
org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.getStore(FilePageStoreManager.java:1132)
at 
org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.read(FilePageStoreManager.java:482)
at 
org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.read(FilePageStoreManager.java:469)
at 
org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:854)
at 
org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:681)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.getOrAllocateCacheMetas(GridCacheOffheapManager.java:869)
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.initDataStructures(GridCacheOffheapManager.java:128)
at 
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.start(IgniteCacheOffheapManagerImpl.java:193)
at 
org.apache.ignite.internal.processors.cache.CacheGroupContext.start(CacheGroupContext.java:1043)
at 
org.apache.ignite.internal.processors.cache.GridCacheProcessor.startCacheGroup(GridCacheProcessor.java:2829)
at 
org.apache.ignite.internal.processors.cache.GridCacheProcessor.getOrCreateCacheGroupContext(GridCacheProcessor.java:2557)
at 
org.apache.ignite.internal.processors.cache.GridCacheProcessor.prepareCacheContext(GridCacheProcessor.java:2387)
at 
org.apache.ignite.internal.processors.cache.GridCacheProcessor.lambda$null$6a5b31b9$1(GridCacheProcessor.java:2209)
at 
org.apache.ignite.internal.processors.cache.GridCacheProcessor.lambda$prepareStartCaches$5(GridCacheProcessor.java:2130)
at 
org.apache.ignite.internal.processors.cache.GridCacheProcessor.lambda$prepareStartCaches$926b6886$1(GridCacheProcessor.java:2206)
at 
org.apache.ignite.internal.util.IgniteUtils.lambda$null$1(IgniteUtils.java:10874)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-11739) Refactoring of cache lifecycle and cache configuration management code

2019-04-12 Thread Sergey Chugunov (JIRA)
Sergey Chugunov created IGNITE-11739:


 Summary: Refactoring of cache lifecycle and cache configuration 
management code
 Key: IGNITE-11739
 URL: https://issues.apache.org/jira/browse/IGNITE-11739
 Project: Ignite
  Issue Type: Task
  Components: cache
Reporter: Sergey Chugunov


h2. Problem
Currently code responsible for cache lifecycle and configuration management is 
spread across different entities (e.g. GridCacheProcessor, 
GridCacheAffinityManager, ClusterCachesInfo and so on).
Cache configuration data is duplicated multiple times and presented in 
different forms from StoredCacheData to CacheGroupDescriptor to 
DynamicCacheDescriptor to ClusterCachesInfo.

Altogether there is no entity nor abstraction which contains most of the logic 
of managing cache state and config and provides a clean API for this purpose.

All this makes it hard to maintain the code, fix bugs and make improvements so 
need for refactoring and benefits from it are obvious.

h2. Approaches
# Build state machine manipulating immutable state objects to reflect 
transitions between states.
# Concentrate all cache-related info into a new (abstraction like cache 
container) or existing (e.g. cache context) mutable entity and manipulate that 
entity to reflect evolution of cache.
# Some mix of these two approaches.

There are already plenty of entities like CacheInfo or CacheDescriptor with 
names suggesting they contain information about cache. 
The problem though is that each of these entities manages only some part of 
information.

Regardless of which approach is used clear and well documented API should be 
provided for managing lifecycle and configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-11621) Node is stuck in "No next node in topology" infinite loop in special case.

2019-03-25 Thread Sergey Chugunov (JIRA)
Sergey Chugunov created IGNITE-11621:


 Summary: Node is stuck in "No next node in topology" infinite loop 
in special case.
 Key: IGNITE-11621
 URL: https://issues.apache.org/jira/browse/IGNITE-11621
 Project: Ignite
  Issue Type: Bug
Reporter: Sergey Chugunov
Assignee: Sergey Chugunov
 Attachments: NoNextNodeInTopologyReproducer.java

In special case (reproducer is attached) node may stuck in the loop when the 
following sequence of events happens:
* Nodes A and B are already in cluster.
* Node C starts joining the cluster.
* On node C NodeAdded message new node D is started.
* Before NodeAddFinished for node C reaches it socket to node C fails and node 
is considered failed by the cluster.
* When NodeFailed message for node C reaches node B both A and B fails.
* After that node D gets stuck in infinite "No next node in topology" loop 
processing NodeFailed messages for A, B and C indefinitely.

The main logic in attached reproducer lives in node1SpecialSpi - it is a 
TcpDiscoverySpi node B starts with.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-11493) Test CheckpointFreeListTest#testFreeListRestoredCorrectly always fails in DiskCompression suite

2019-03-06 Thread Sergey Chugunov (JIRA)
Sergey Chugunov created IGNITE-11493:


 Summary: Test CheckpointFreeListTest#testFreeListRestoredCorrectly 
always fails in DiskCompression suite
 Key: IGNITE-11493
 URL: https://issues.apache.org/jira/browse/IGNITE-11493
 Project: Ignite
  Issue Type: Bug
Reporter: Sergey Chugunov


Test fails with the following NullPointerException in logs:
{code}
[2019-03-06 
16:05:24,353][ERROR][exchange-worker-#94%client%][IgniteTestResources] Critical 
system error detected. Will be handled accordingly to configured handler 
[hnd=NoOpFailureHandler [super=AbstractFailureHandler 
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext 
[type=SYSTEM_WORKER_TERMINATION, err=class o.a.i.IgniteCheckedException: null]]
class org.apache.ignite.IgniteCheckedException: null
at 
org.apache.ignite.internal.util.IgniteUtils.cast(IgniteUtils.java:7323)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.resolve(GridFutureAdapter.java:260)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:209)
at 
org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:160)
at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2948)
at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2769)
at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at 
org.apache.ignite.internal.processors.cache.CacheCompressionManager.start0(CacheCompressionManager.java:55)
at 
org.apache.ignite.internal.processors.cache.GridCacheManagerAdapter.start(GridCacheManagerAdapter.java:50)
at 
org.apache.ignite.internal.processors.cache.GridCacheProcessor.initCacheContext(GridCacheProcessor.java:2534)
at 
org.apache.ignite.internal.processors.cache.GridCacheProcessor.prepareCacheContext(GridCacheProcessor.java:2344)
at 
org.apache.ignite.internal.processors.cache.GridCacheProcessor.prepareCacheStart(GridCacheProcessor.java:2270)
at 
org.apache.ignite.internal.processors.cache.GridCacheProcessor.lambda$prepareStartCaches$55a0e703$1(GridCacheProcessor.java:2141)
at 
org.apache.ignite.internal.processors.cache.GridCacheProcessor.lambda$prepareStartCaches$5(GridCacheProcessor.java:2094)
at 
org.apache.ignite.internal.processors.cache.GridCacheProcessor.prepareStartCaches(GridCacheProcessor.java:2138)
at 
org.apache.ignite.internal.processors.cache.GridCacheProcessor.prepareStartCaches(GridCacheProcessor.java:2093)
at 
org.apache.ignite.internal.processors.cache.GridCacheProcessor.startCachesOnLocalJoin(GridCacheProcessor.java:2039)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.initCachesOnLocalJoin(GridDhtPartitionsExchangeFuture.java:951)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:810)
at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2920)
... 3 more
{code}

Root cause of it is that CacheManager when initializing CacheContext on client 
tries to start GridCompressionManager which doesn't make sense on client node.

We should either exclude compression manager from cache context on client or 
not start it during initialization phase.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-11459) Possible dead code in TcpDiscoveryStatusCheckMessage flow

2019-02-28 Thread Sergey Chugunov (JIRA)
Sergey Chugunov created IGNITE-11459:


 Summary: Possible dead code in TcpDiscoveryStatusCheckMessage flow
 Key: IGNITE-11459
 URL: https://issues.apache.org/jira/browse/IGNITE-11459
 Project: Ignite
  Issue Type: Improvement
Reporter: Sergey Chugunov


Working on IGNITE-11364 I found the following suspicious detail about 
StatusCheck flow: in the message there is a special field {{failedNodeId}} 
which seems to duplicate functionality of {{failedNodes}} collection in 
TcpDiscoveryAbstractMessage.

{{failedNodeId}} field is filled only in special scenario of failed ping or 
remote node. It is used *only* to ignore the message.

Historical overview of this field revealed commit *838с0fd* where a meaningful 
piece of code was either intentionally removed or accidentally lost:
{noformat}
if (msg instanceof TcpDiscoveryStatusCheckMessage) {
TcpDiscoveryStatusCheckMessage msg0 = 
(TcpDiscoveryStatusCheckMessage)msg;

if (next.id().equals(msg0.failedNodeId())) {
next = null;

if (log.isDebugEnabled())
log.debug("Discarding status check since next 
node has indeed failed [next=" + next +
", msg=" + msg + ']');

// Discard status check message by exiting loop and 
handle failure.
break;
}
}
{noformat}

Conclusion: field {{failedNodeId}} and the whole flow around it looks 
suspicious and has to be reviewed for flaws. Review should result in either 
redesign of the flow or deleting the code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-11348) Ping node procedure may fail when another node leaves the cluster

2019-02-18 Thread Sergey Chugunov (JIRA)
Sergey Chugunov created IGNITE-11348:


 Summary: Ping node procedure may fail when another node leaves the 
cluster
 Key: IGNITE-11348
 URL: https://issues.apache.org/jira/browse/IGNITE-11348
 Project: Ignite
  Issue Type: Bug
Reporter: Sergey Chugunov
Assignee: Sergey Chugunov
 Fix For: 2.8


Additional pinging of node on join implemented in IGNITE-5569 may incorrectly 
fail leading to shutting down joining node.

The reason for this is that if another node from the same host bound to the 
same discovery port as joining node has left the cluster right before joining 
node, socket used for pinging gets closed.
This leads to the situation when pinging node considers joining node as 
"unreachable" and fails it with JOIN_IMPOSSIBLE error code.

Workaround: just to start again node failed on join.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-11290) History of server node IDs should be passed to new nodes with NodeAddedMessage

2019-02-11 Thread Sergey Chugunov (JIRA)
Sergey Chugunov created IGNITE-11290:


 Summary: History of server node IDs should be passed to new nodes 
with NodeAddedMessage
 Key: IGNITE-11290
 URL: https://issues.apache.org/jira/browse/IGNITE-11290
 Project: Ignite
  Issue Type: Bug
Reporter: Sergey Chugunov
Assignee: Sergey Chugunov
 Fix For: 2.8


As part of IGNITE-5569 (bounded) history of IDs of all server nodes existed in 
the cluster is introduced to prevent join requests with duplicate IDs if 
network glitch happens during node's join process.

Initial implementation maintains the history locally on each node and isn't 
transferred to successfully joined nodes.

It is needed to pass it (in NodeAdded messages) to new nodes to cover edge-case 
scenarios of coordinator failover.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-11159) Collections of 'start-on-join' caches and 'init-caches' should be filtered

2019-01-31 Thread Sergey Chugunov (JIRA)
Sergey Chugunov created IGNITE-11159:


 Summary: Collections of 'start-on-join' caches and 'init-caches' 
should be filtered
 Key: IGNITE-11159
 URL: https://issues.apache.org/jira/browse/IGNITE-11159
 Project: Ignite
  Issue Type: Bug
Reporter: Sergey Chugunov






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Baseline auto-adjust`s discuss

2019-01-25 Thread Sergey Chugunov
Anton,

As I understand from the IEP document policy was supposed to support two
timeouts: soft and hard, so here you're proposing a bit simpler
functionality.

Just to clarify, do I understand correctly that this feature when enabled
will auto-adjust blt on each node join/node left event, and timeout is
necessary to protect us from blinking nodes?
So no complexities with taking into account number of alive backups or
something like that?

On Fri, Jan 25, 2019 at 1:11 PM Vladimir Ozerov 
wrote:

> Got it, makes sense.
>
> On Fri, Jan 25, 2019 at 11:06 AM Anton Kalashnikov 
> wrote:
>
> > Vladimir, thanks  for your notes, both of them looks good enough but I
> > have two different thoughts about it.
> >
> > I think I agree about enabling only one of manual/auto adjustment. It is
> > easier than current solution and in fact as extra feature  we can allow
> > user to force task to execute(if they doesn't want to wait until timeout
> > expired).
> > But about second one I don't sure that one parameters instead of two
> would
> > be more convenient. For example: in case when user changed timeout and
> then
> > disable auto-adjust after then when someone will want to enable it they
> > should know what value of timeout was before auto-adjust was disabled. I
> > think "negative value" pattern good choice for always usable parameters
> > like timeout of connection (ex. -1 equal to endless waiting) and so on,
> but
> > in our case we want to disable whole functionality rather than change
> > parameter value.
> >
> > --
> > Best regards,
> > Anton Kalashnikov
> >
> >
> > 24.01.2019, 22:03, "Vladimir Ozerov" :
> > > Hi Anton,
> > >
> > > This is great feature, but I am a bit confused about automatic
> disabling
> > of
> > > a feature during manual baseline adjustment. This may lead to
> unpleasant
> > > situations when a user enabled auto-adjustment, then re-adjusted it
> > > manually somehow (e.g. from some previously created script) so that
> > > auto-adjustment disabling went unnoticed, then added more nodes hoping
> > that
> > > auto-baseline is still active, etc.
> > >
> > > Instead, I would rather make manual and auto adjustment mutually
> > exclusive
> > > - baseline cannot be adjusted manually when auto mode is set, and vice
> > > versa. If exception is thrown in that cases, administrators will always
> > > know current behavior of the system.
> > >
> > > As far as configuration, wouldn’t it be enough to have a single long
> > value
> > > as opposed to Boolean + long? Say, 0 - immediate auto adjustment,
> > negative
> > > - disabled, positive - auto adjustment after timeout.
> > >
> > > Thoughts?
> > >
> > > чт, 24 янв. 2019 г. в 18:33, Anton Kalashnikov :
> > >
> > >>  Hello, Igniters!
> > >>
> > >>  Work on the Phase II of IEP-4 (Baseline topology) [1] has started. I
> > want
> > >>  to start to discuss of implementation of "Baseline auto-adjust" [2].
> > >>
> > >>  "Baseline auto-adjust" feature implements mechanism of auto-adjust
> > >>  baseline corresponding to current topology after event join/left was
> > >>  appeared. It is required because when a node left the grid and nobody
> > would
> > >>  change baseline manually it can lead to lost data(when some more
> nodes
> > left
> > >>  the grid on depends in backup factor) but permanent tracking of grid
> > is not
> > >>  always possible/desirible. Looks like in many cases auto-adjust
> > baseline
> > >>  after some timeout is very helpfull.
> > >>
> > >>  Distributed metastore[3](it is already done):
> > >>
> > >>  First of all it is required the ability to store configuration data
> > >>  consistently and cluster-wide. Ignite doesn't have any specific API
> for
> > >>  such configurations and we don't want to have many similar
> > implementations
> > >>  of the same feature in our code. After some thoughts is was proposed
> to
> > >>  implement it as some kind of distributed metastorage that gives the
> > ability
> > >>  to store any data in it.
> > >>  First implementation is based on existing local metastorage API for
> > >>  persistent clusters (in-memory clusters will store data in memory).
> > >>  Write/remove operation use Discovery SPI to send updates to the
> > cluster, it
> > >>  guarantees updates order and the fact that all existing (alive) nodes
> > have
> > >>  handled the update message. As a way to find out which node has the
> > latest
> > >>  data there is a "version" value of distributed metastorage, which is
> > >>  basically . All updates
> history
> > >>  until some point in the past is stored along with the data, so when
> an
> > >>  outdated node connects to the cluster it will receive all the missing
> > data
> > >>  and apply it locally. If there's not enough history stored or joining
> > node
> > >>  is clear then it'll receive shapshot of distributed metastorage so
> > there
> > >>  won't be inconsistencies.
> > >>
> > >>  Baseline auto-adjust:
> > >>
> > >>  Main scenario:
> > >>  - There is grid with the baseline 

[jira] [Created] (IGNITE-11011) Initialize components with grid disco data when NodeAddFinished message is received

2019-01-21 Thread Sergey Chugunov (JIRA)
Sergey Chugunov created IGNITE-11011:


 Summary: Initialize components with grid disco data when 
NodeAddFinished message is received
 Key: IGNITE-11011
 URL: https://issues.apache.org/jira/browse/IGNITE-11011
 Project: Ignite
  Issue Type: Improvement
Reporter: Sergey Chugunov
Assignee: Sergey Chugunov
 Fix For: 2.8


There is an issue when CacheProcessor on fresh coordinator (the very first node 
in new topology) receives grid discovery data from another cluster that died 
before this node has joined its topology but after sending NodeAdded message.

IGNITE-10878 fixes it by filtering cache descriptors and cache groups in 
GridCacheProcessor which is not generic solution.

To fix the issue in a true generic way node should initialize its components 
(including cache processor) not on receiving NodeAdded message but 
NodeAddFinished message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-10819) Test IgniteClientRejoinTest.testClientsReconnectAfterStart became flaky in master recently

2018-12-25 Thread Sergey Chugunov (JIRA)
Sergey Chugunov created IGNITE-10819:


 Summary: Test 
IgniteClientRejoinTest.testClientsReconnectAfterStart became flaky in master 
recently
 Key: IGNITE-10819
 URL: https://issues.apache.org/jira/browse/IGNITE-10819
 Project: Ignite
  Issue Type: Bug
Reporter: Sergey Chugunov
 Fix For: 2.8


As [test 
history|https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-21180267941031641=testDetails_IgniteTests24Java8=%3Cdefault%3E]
 in master branch shows the test has become flaky in master recently.

Test started failing when IGNITE-10555 was merged to master.

The reason of failure is timeout when *client4* node hangs waiting for PME to 
complete. Communication failures are emulated in the test and when all clients 
fail to init an exchange on a specific affinity topology version (major=7, 
minor=1) everything works fine.
But sometimes *client4* node manages to finish init the exchange and hangs 
forever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-10809) IgniteClusterActivateDeactivateTestWithPersistence.testActivateFailover3 fails in master

2018-12-24 Thread Sergey Chugunov (JIRA)
Sergey Chugunov created IGNITE-10809:


 Summary: 
IgniteClusterActivateDeactivateTestWithPersistence.testActivateFailover3 fails 
in master
 Key: IGNITE-10809
 URL: https://issues.apache.org/jira/browse/IGNITE-10809
 Project: Ignite
  Issue Type: Bug
Reporter: Sergey Chugunov
Assignee: Sergey Chugunov
 Fix For: 2.8


Test logic involves independent activation two sets of nodes and then their 
join into a single cluster.

After introducing BaselineTopology concept in 2.4 version this action became 
prohibited to enforce data integrity.

Test should be refactored to take this into account.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] ignite pull request #5600: Ignite 10374

2018-12-07 Thread sergey-chugunov-1985
GitHub user sergey-chugunov-1985 opened a pull request:

https://github.com/apache/ignite/pull/5600

Ignite 10374



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gridgain/apache-ignite ignite-10374-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/ignite/pull/5600.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5600


commit 52cf2809a759071d440719f968ab0d0040fdf23e
Author: EdShangGG 
Date:   2018-07-17T15:04:38Z

IGNITE-9013 Fail cache future when local node is stopping - Fixes #4369.

(cherry picked from commit 85b2002796fb601d7e7ce7d7320943f9323c2bdd)

commit 92dbb2197488b5c0c61182586d49254263b3a49b
Author: Evgeny Stanilovskiy 
Date:   2018-08-10T14:47:13Z

IGNITE-9231 improvement throttle implementation, unpark threads on cpBuf 
condition. - Fixes #4506.

Signed-off-by: Ivan Rakov 

commit 0e66d270a41caeedee20a93a3ad4aea95f074dce
Author: Dmitriy Govorukhin 
Date:   2018-08-13T08:59:27Z

IGNITE-9244 Partition eviction should not take all threads in system pool

(cherry picked from commit 2d63040)
Signed-off-by: Dmitriy Govorukhin 

commit fcef6b826ef1eef13c14e752e52c1d05e49ac508
Author: Alexey Kukushkin 
Date:   2018-04-26T16:31:43Z

IGNITE-8237 Ignite blocks on SecurityException in exchange-worker due to 
unauthorised on-heap cache configuration. - Fixes #3818.

Signed-off-by: dpavlov 
(cherry picked from commit 54cb262438bc83af3c4e864a7e5897b36fcd8c73)

commit 0a12c50b4755380ab1b25dceee7420d465a6bfec
Author: dpavlov 
Date:   2018-04-26T16:38:05Z

IGNITE-8237 Javadoc for method parameters added.

(cherry picked from commit ebe55e3ff84232f67a2885354e3e26426a6b16cb)

commit 35cc55405823482e51929d42b74c5e9030bb74e9
Author: Evgeny Stanilovskiy 
Date:   2018-08-10T13:21:25Z

IGNITE-8724 Fixed misleading U.warn implementation - Fixes #4145.

commit 3ce67e86bde0c12f954403c12b1b799661f5d5ae
Author: Dmitriy Govorukhin 
Date:   2018-08-14T12:10:30Z

Merge remote-tracking branch 'professional/ignite-2.5.1-master' into 
ignite-2.5.1-master

commit 8ee5db8f253a0092a2200593fa28887878cd9a15
Author: Dmitriy Govorukhin 
Date:   2018-08-10T12:32:19Z

IGNITE-9050 WAL iterator should throw an exception if segment tail is 
reached inside archive directory - Fixes #4429.

Signed-off-by: Alexey Goncharuk 

(cherry picked from commit dbf5574)

commit 88c1035404a7a5b778e74a45bd443200602f15bd
Author: Dmitriy Govorukhin 
Date:   2018-08-14T12:03:21Z

IGNITE-9260 Fixed failing test (omit check in standalone WAL iterator) - 
Fixes #4533.

Signed-off-by: Alexey Goncharuk 

(cherry picked from commit 237a99e)

commit 80c46addc69d93fd082c37547c6368ee8616ad86
Author: vd-pyatkov 
Date:   2018-08-15T11:42:42Z

IGNITE-8761 WAL fsync at rollover should be asynchronous in LOG_ONLY and 
BACKGROUND modes - Fixes #4356.

Signed-off-by: Ivan Rakov 
(cherry picked from commit 3e75f9101411f0a6bf72aee1e52b2fc3507792ab)

commit fa458b9d715e4f7e01ac2c0363ade40a0106cf29
Author: ascherbakoff 
Date:   2018-08-11T11:13:26Z

IGNITE-9246 Optimistic transactions can wait for topology future on remap 
for a long time even if timeout is set.

(cherry picked from commit 5d151063d554f23858373d642e7875b6a4f206f9)

commit eadc99c7052a10168b4072c457be070d48e25dc8
Author: Aleksei Scherbakov 
Date:   2018-08-13T10:29:29Z

IGNITE-9147 Race between tx async rollback and lock mapping on near node 
can produce hanging primary tx

(cherry picked from commit a3f9076e475f603a6c3a457b64f721fd0c24a396)

commit 3e18a523817f7175ec3131db26147e352d8a8324
Author: Sergey Kosarev 
Date:   2018-08-15T14:59:32Z

fix imports (IGNITE-9244)

commit b4e29ad9a9dcd3572366ed000395ec3ffc34a79f
Author: Ivan Daschinskiy 
Date:   2018-08-16T11:21:19Z

GG-14091 Add idle verify dump and idle verify v2 to security chain.

commit 64b6504b4273e5a661c1420aea8ecb888f953c30
Author: Denis Mekhanikov 
Date:   2018-08-16T14:33:52Z

IGNITE-9196: SQL: Fixed memory lead in mapper. This closes #4505.

(cherry picked from commit bfa192ca473c992353c8bae8ba9aa5fa359378b3)

commit b32f510e0d16adca2232f15ae70af1f15eb39940
Author: Pavel Kovalenko 
Date:   2018-08-16T15:39:49Z

IGNITE-9227 Fixed missing reply to a single message during coordinator 
failover. Fixes #4518.

(cherry picked from commit 66fcde3)

commit bc1a1c685ec66be5f6360a36f7f842e79b040412
Author: Evgenii Zhuravlev 
Date:   2018-08-10T11:23:37Z

IGNITE-5103 Fixed TcpDiscoverySpi not to ignore maxMissedHeartbeats 
property. Fixes #4446.

(cherry picked from commit 1c840f59016273e0e99c95345c3afde639ef9689)

commit d8af4076b65302ea31af461cda3fe747aea7c583
Author: Evgeny Stanilovskiy 
Date:   2018-08-15T17:28:48Z

IGNITE-8493

[GitHub] ignite pull request #5578: IGNITE_DISABLE_WAL_DURING_REBALANCING turned on b...

2018-12-05 Thread sergey-chugunov-1985
GitHub user sergey-chugunov-1985 opened a pull request:

https://github.com/apache/ignite/pull/5578

IGNITE_DISABLE_WAL_DURING_REBALANCING turned on by default, test for race 
between checkpointer and affinity change added



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gridgain/apache-ignite ignite-10505

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/ignite/pull/5578.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5578


commit a111842d639a9b7f0e5a8d23fc6521b7dffa978e
Author: Sergey Chugunov 
Date:   2018-12-05T12:37:20Z

IGNITE-10505 test for race between checkpointer and affinity change

commit 12f311d58ba4b7464dde8abc82d06a925b855195
Author: Sergey Chugunov 
Date:   2018-12-05T12:49:12Z

IGNITE-10505 merge master




---


Re: [DISCUSSION] Design document. Rebalance caches by transferring partition files

2018-11-27 Thread Sergey Chugunov
Eduard,

This algorithm looks much easier but could you clarify some edge cased
please?

If I understand correctly when there is a continuous flow of updates to the
page already transferred to receiver checkpointer will write this page to
the log file over and over again. Do you see here any risks of exhausting
disk space on sender's side?

What if some updates come after checkpointer stopped updating log file? How
these updates will be transferred to the receiver and applied there?

On Tue, Nov 27, 2018 at 7:52 PM Eduard Shangareev <
eduard.shangar...@gmail.com> wrote:

> So, after some discussion, I could describe another approach on how to
> build consistent partition on the fly.
>
> 1. We make a checkpoint, fix the size of the partition in OffheapManager.
> 2. After checkpoint finish, we start sending partition file (without any
> lock) to the receiver from 0 to fixed size.
> 3. Next checkpoints if they detect that they would override some pages of
> transferring file should write the previous state of a page to a dedicated
> file.
> So, we would have a list of pages written 1 by 1, page id is written in the
> page itself so we could determine page index. Let's name it log.
> 4. When transfer finished checkpointer would stop updating log-file. Now we
> are ready to send it to the receiver.
> 5. On receiver side we start merging the dirty partition file with log
> (updating it with pages from log-file).
>
> So, an advantage of this method:
> - checkpoint-thread work couldn't  increase more than twice;
> - checkpoint-threads shouldn't wait for anything;
> - in best case, we receive partition without any extra effort.
>
>
> On Mon, Nov 26, 2018 at 8:54 PM Eduard Shangareev <
> eduard.shangar...@gmail.com> wrote:
>
> > Maxim,
> >
> > I have looked through your algorithm of reading partition consistently.
> > And I have some questions/comments.
> >
> > 1. The algorithm requires heavy synchronization between checkpoint-thread
> > and new-approach-rebalance-threads,
> > because you need strong guarantees to not start writing or reading to
> > chunk which was updated or started reading by the counterpart.
> >
> > 2. Also, if we have started transferring this chunk in original partition
> > couldn't be updated by checkpoint-threads. They should wait for transfer
> > finishing.
> >
> > 3. If sending is slow and partition is updated then in worst case
> > checkpoint-threads would create the whole copy of the partition.
> >
> > So, what we have:
> > -on every page write checkpoint-thread should synchronize with
> > new-approach-rebalance-threads;
> > -checkpoint-thread should do extra-work, sometimes this could be as huge
> > as copying the whole partition.
> >
> >
> > On Fri, Nov 23, 2018 at 2:55 PM Ilya Kasnacheev <
> ilya.kasnach...@gmail.com>
> > wrote:
> >
> >> Hello!
> >>
> >> This proposal will also happily break my compression-with-dictionary
> patch
> >> since it relies currently on only having local dictionaries.
> >>
> >> However, when you have compressed data, maybe speed boost is even
> greater
> >> with your approach.
> >>
> >> Regards,
> >> --
> >> Ilya Kasnacheev
> >>
> >>
> >> пт, 23 нояб. 2018 г. в 13:08, Maxim Muzafarov :
> >>
> >> > Igniters,
> >> >
> >> >
> >> > I'd like to take the next step of increasing the Apache Ignite with
> >> > enabled persistence rebalance speed. Currently, the rebalancing
> >> > procedure doesn't utilize the network and storage device throughout to
> >> > its full extent even with enough meaningful values of
> >> > rebalanceThreadPoolSize property. As part of the previous discussion
> >> > `How to make rebalance faster` [1] and IEP-16 [2] Ilya proposed an
> >> > idea [3] of transferring cache partition files over the network.
> >> > From my point, the case to which this type of rebalancing procedure
> >> > can bring the most benefit – is adding a completely new node or set of
> >> > new nodes to the cluster. Such a scenario implies fully relocation of
> >> > cache partition files to the new node. To roughly estimate the
> >> > superiority of partition file transmitting over the network the native
> >> > Linux scp\rsync commands can be used. My test environment showed the
> >> > result of the new approach as 270 MB/s vs the current 40 MB/s
> >> > single-threaded rebalance speed.
> >> >
> >> >
> >> > I've prepared the design document IEP-28 [4] and accumulated all the
> >> > process details of a new rebalance approach on that page. Below you
> >> > can find the most significant details of the new rebalance procedure
> >> > and components of the Apache Ignite which are proposed to change.
> >> >
> >> > Any feedback is very appreciated.
> >> >
> >> >
> >> > *PROCESS OVERVIEW*
> >> >
> >> > The whole process is described in terms of rebalancing single cache
> >> > group and partition files would be rebalanced one-by-one:
> >> >
> >> > 1. The demander node sends the GridDhtPartitionDemandMessage to the
> >> > supplier node;
> >> > 2. When the supplier node receives 

Re: [VOTE] Creation dedicated list for github notifiacations

2018-11-26 Thread Sergey Chugunov
+1

Plus this dedicated list should be properly documented in wiki, mentioning
it in How to Contribute [1] or in Make Teamcity Green Again [2] would be a
good idea.

[1] https://cwiki.apache.org/confluence/display/IGNITE/How+to+Contribute
[2]
https://cwiki.apache.org/confluence/display/IGNITE/Make+Teamcity+Green+Again

On Tue, Nov 27, 2018 at 9:51 AM Павлухин Иван  wrote:

> +1
> вт, 27 нояб. 2018 г. в 09:22, Dmitrii Ryabov :
> >
> > 0
> > вт, 27 нояб. 2018 г. в 02:33, Alexey Kuznetsov :
> > >
> > > +1
> > > Do not forget notification from GitBox too!
> > >
> > > On Tue, Nov 27, 2018 at 2:20 AM Zhenya 
> wrote:
> > >
> > > > +1, already make it by filers.
> > > >
> > > > > This was discussed already [1].
> > > > >
> > > > > So, I want to complete this discussion with moving outside dev-list
> > > > > GitHub-notification to dedicated list.
> > > > >
> > > > > Please start voting.
> > > > >
> > > > > +1 - to accept this change.
> > > > > 0 - you don't care.
> > > > > -1 - to decline this change.
> > > > >
> > > > > This vote will go for 72 hours.
> > > > >
> > > > > [1]
> > > > >
> > > >
> http://apache-ignite-developers.2346864.n4.nabble.com/Time-to-remove-automated-messages-from-the-devlist-td37484i20.html
> > > >
> > >
> > >
> > > --
> > > Alexey Kuznetsov
>
>
>
> --
> Best regards,
> Ivan Pavlukhin
>


[jira] [Created] (IGNITE-10409) ExchangeFuture should be in charge on cancelling rebalancing process

2018-11-26 Thread Sergey Chugunov (JIRA)
Sergey Chugunov created IGNITE-10409:


 Summary: ExchangeFuture should be in charge on cancelling 
rebalancing process
 Key: IGNITE-10409
 URL: https://issues.apache.org/jira/browse/IGNITE-10409
 Project: Ignite
  Issue Type: Improvement
Reporter: Sergey Chugunov
 Fix For: 2.8


Ticket IGNITE-7165 introduced improvement of not cancelling any on-going 
partition rebalancing process when client node joins topology. Client join 
event doesn't change affinity distribution so on-going rebalance remains valid, 
no need to cancel it and restart again.
Implementation was based on introducing new method *rebalanceRequired* in 
*GridCachePreloader* interface.

At the same time PME optimizations efforts enhanced ExchangeFuture 
functionality so now the future itself contains all information about weather 
affinity changed or not.

We need to rework code changes from IGNITE-7165 and base it on ExchangeFuture 
functionality instead of new method in Preloader interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] ignite pull request #5468: IGNITE-10374 if rebalance isn't restarted no need...

2018-11-21 Thread sergey-chugunov-1985
GitHub user sergey-chugunov-1985 opened a pull request:

https://github.com/apache/ignite/pull/5468

IGNITE-10374 if rebalance isn't restarted no need to disable already …

…disabled WAL

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gridgain/apache-ignite ignite-10374

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/ignite/pull/5468.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5468


commit 22d0ac9f0c9ae16bb9c1ce24c8c6f05b586e1073
Author: Sergey Chugunov 
Date:   2018-11-22T07:56:44Z

IGNITE-10374 if rebalance isn't restarted no need to disable already 
disabled WAL




---


[jira] [Created] (IGNITE-10374) Node doesn't own rebalanced partitions on rebalancing finished

2018-11-21 Thread Sergey Chugunov (JIRA)
Sergey Chugunov created IGNITE-10374:


 Summary: Node doesn't own rebalanced partitions on rebalancing 
finished
 Key: IGNITE-10374
 URL: https://issues.apache.org/jira/browse/IGNITE-10374
 Project: Ignite
  Issue Type: Bug
Reporter: Sergey Chugunov
Assignee: Sergey Chugunov
 Fix For: 2.8


Prerequisite: flag *IGNITE_DISABLE_WAL_DURING_REBALANCING* is set to true 
(default value is false).

Scenario:
* Node joins the grid and starts rebalancing all cache groups from scratch 
(e.g. all db files of the node were cleaned up during its downtime);
* One or more client nodes join topology when rebalancing is in progress.

Expected outcome:
Rebalance finishes, node owns all received partitions, new affinity is applied.

Actual outcome:
Rebalance finishes, but node doesn't own any of received partitions, no 
affinity changes take place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Brainstorm: Make TC Run All faster

2018-11-15 Thread Sergey Chugunov
Dmitriy,

You brought up really important topic that has a great impact on our
project. Faster runAlls mean quicker feedback and faster progress on issues
and features.

We have a pretty big code base of tests, about 50 thousands tests. Do we
have an idea of how these tests overlap with each other? In my mind it is
possible that we have a good bunch of tests that cover the same code and
could be replaced with just a single test.

In the ideal world we would even determine the minimal set of tests to
cover our codebase and remove excessive.

--
Best regards,
Sergey Chugunov.

On Thu, Nov 15, 2018 at 2:34 PM Dmitriy Pavlov  wrote:

> Hi Igniters,
>
>
>
> Some of us started to use the Bot to get an approval of PRs. It helps to
> protect master from new failures, but this requires to run RunAll tests set
> for each commit and this makes markable pressure to TC infra.
>
>
>
> I would like to ask you to share your ideas on how to make runAll faster,
> maybe you can share any of your measurements and any other info about
> (possible) bottlenecks.
>
>
>
> Sincerely,
>
> Dmitriy Pavlov
>


[jira] [Created] (IGNITE-10153) [TC Bot] Implement tests running time report

2018-11-07 Thread Sergey Chugunov (JIRA)
Sergey Chugunov created IGNITE-10153:


 Summary: [TC Bot] Implement tests running time report
 Key: IGNITE-10153
 URL: https://issues.apache.org/jira/browse/IGNITE-10153
 Project: Ignite
  Issue Type: Task
Reporter: Sergey Chugunov
Assignee: Sergey Chugunov


In order to optimize running time of existing test base (at the moment all 
tests require ~ 50 hours of running time on available agents) we need a report 
page with info about each suite.

At the first stage page will show tests running longer than 1 minute for each 
suite in the latest run in master branch.

Later other features may be added like analyzing PRs or other branches, 
adjusting running time limit (e.g. all tests longer than 30 seconds and so on).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] ignite pull request #5047: IGNITE-9957 Updates count was reduced to speed up...

2018-10-22 Thread sergey-chugunov-1985
GitHub user sergey-chugunov-1985 opened a pull request:

https://github.com/apache/ignite/pull/5047

IGNITE-9957 Updates count was reduced to speed up the test



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gridgain/apache-ignite ignite-9957

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/ignite/pull/5047.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5047


commit 9fb69711329e6651c32c04c1df17f9709ddaec66
Author: Sergey Chugunov 
Date:   2018-10-22T11:25:28Z

IGNITE-9957 Updates count was reduced to speed up the test




---


[jira] [Created] (IGNITE-9958) Optimize execution time of CacheContinuousQueryVariationsTest

2018-10-22 Thread Sergey Chugunov (JIRA)
Sergey Chugunov created IGNITE-9958:
---

 Summary: Optimize execution time of 
CacheContinuousQueryVariationsTest
 Key: IGNITE-9958
 URL: https://issues.apache.org/jira/browse/IGNITE-9958
 Project: Ignite
  Issue Type: Improvement
Reporter: Sergey Chugunov
 Fix For: 2.8


Tests from CacheContinuousQueryVariationsTest require a lot of time ([sample 
run|https://ci.ignite.apache.org/viewLog.html?buildId=2136245=IgniteTests24Java8_RunAll=testsInfo])
 and thus slow down build on TeamCity significantly.

They need to be investigated and optimized if possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9957) Optimize execution time of BinaryMetadataUpdatesFlowTest

2018-10-22 Thread Sergey Chugunov (JIRA)
Sergey Chugunov created IGNITE-9957:
---

 Summary: Optimize execution time of BinaryMetadataUpdatesFlowTest
 Key: IGNITE-9957
 URL: https://issues.apache.org/jira/browse/IGNITE-9957
 Project: Ignite
  Issue Type: Improvement
Reporter: Sergey Chugunov
 Fix For: 2.8


As TC statistics shows ([example run on master 
branch|https://ci.ignite.apache.org/viewLog.html?currentGroup=test=org.apache.ignite.testsuites.IgniteBinaryObjectsTestSuite%23teamcity%23org.apache.ignite.internal.processors.cache.binary%23teamcity%23BinaryMetadataUpdatesFlowTest=1=DURATION_DESC=20===IgniteTests24Java8_RunAll=2136245=testsInfo]),
 three tests within this class require about 6 minutes of running time which is 
a lot.

It is worth investigating tests and speed them up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   5   >