Node and cluster life-cycle in ignite-3
Hello Igniters, I would like to start a discussion on evolving IEP-73 [1]. Now it covers a narrow topic about components dependencies but it makes sense to cover in the IEP a broader question: how different components should be initialized to support different modes of an individual node or a whole cluster. There is an idea to borrow the notion of run-levels from Unix-like systems, and I suggest the following design to implement it. 1. To start and function at a specific run-level node needs to start and initialize components in a proper order. During initialization components may need to notify each other about reaching a particular run-level so other components are able to execute their actions. Orchestrating of this process should be a responsibility of a new component. 2. Orchestration component doesn't manage the initialization process directly but uses another abstraction called scenario. Examples of run-levels in the context of Ignite 2.x may be Maintenance Mode, INACTIVE-READONLY-ACTIVE states of a cluster, and each level is reached when a corresponding scenario has executed. So the responsibility of the orchestrator will be managing scenarios and providing them with infrastructure of spreading notification events between components. All low-level details and knowledge of existing components and their dependencies are encapsulated inside scenarios. 3. Scenarios allow nesting, e.g. a scenario for INACTIVE cluster state can be "upgraded" to READONLY state by executing diff between INACTIVE and READONLY scenarios. I see several advantages of this design compared to existing model in Ignite 2.x (mostly implemented in IgniteKernal and based on two main methods: start and onKernalStart): 1. More flexible model allows implementing more diverse run-levels for different needs (already mentioned Maintenance Mode, cluster state modes like ACTIVE-INACTIVE and smart strategies for cache warmup on node start). 2. Knowledge of components and their dependencies is encapsulated inside scenarios which makes it easier to create new scenarios. Open questions: 1. As I see right now it is hard to standardize initialization events components notify each other with. 2. It is not clear if run-levels should be organized into one rigid hierarchy (when the first run-level should always precede the second and so on) or they should be more independent. What do you think? [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-73%3A+Node+startup
Re: Fix force rebuild indexes
Kirill, Indeed current behavior of force rebuild API seems broken, we need to fix it, +1 from me too. BTW would it be useful to allow rebuilding individual indices? On Wed, Mar 24, 2021 at 6:20 PM ткаленко кирилл wrote: > Hello! > > What do you mean by the implementation plan? > Implement ticket https://issues.apache.org/jira/browse/IGNITE-14321 > > 24.03.2021, 17:17, "Maxim Muzafarov" : > > Hello, > > > > I think the issue definitely must be fixed, so +1 from my side. > > BTW, what would be your implementation plan? > > > > I think the [1] issue may be interesting for you. > > > > [1] https://issues.apache.org/jira/browse/IGNITE-13056 > > > > On Tue, 23 Mar 2021 at 21:04, ткаленко кирилл > wrote: > >> Hello everyone! > >> > >> I found that a forced rebuild of indexes does not work correctly. If > the indexes were rebuilt once, then nothing will happen each time a forced > rebuild is attempted. Also, if during the first rebuild of indexes (before > the checkpoint) we call a forced rebuild of indexes, then we will execute > it sequentially after the first. It seems that we need to fix this. > >> > >> I suggest not to allow (throw an exception) to start a forced rebuild > of indexes until the previous one completes. > >> And, of course, fix the ability to launch a forced rebuild of indexes. > >> > >> I want to do this on ticket > https://issues.apache.org/jira/browse/IGNITE-14321. > >> > >> Sorry, the thread was without a subject. > >> http://apache-ignite-developers.2346864.n4.nabble.com/-td51935.html > >> > >> WDYT? >
[jira] [Created] (IGNITE-14382) Network module API structuring
Sergey Chugunov created IGNITE-14382: Summary: Network module API structuring Key: IGNITE-14382 URL: https://issues.apache.org/jira/browse/IGNITE-14382 Project: Ignite Issue Type: Sub-task Components: networking Reporter: Sergey Chugunov Fix For: 3.0.0-alpha2 First version of the network module introduced a NetworkCluster interface providing access to all functionality of the module: sending and receiving messages in p2p fashion, topology API (current set of online nodes, node join and left events) and some lifecycle-related methods. Further development has shown that it makes sense to gather these pieces of functionality under separate interfaces that should be accessible from NetworkCluster or similar single entry point. Suggestions for naming of these interfaces: *Topology* and *Messaging*. Keeping lifecycle callbacks and methods under the same interface seems natural at the moment. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14323) Messaging naming unification
Sergey Chugunov created IGNITE-14323: Summary: Messaging naming unification Key: IGNITE-14323 URL: https://issues.apache.org/jira/browse/IGNITE-14323 Project: Ignite Issue Type: Sub-task Components: networking Reporter: Sergey Chugunov Fix For: 3.0.0-alpha2 Naming of methods for message sending in NetworkCluster interface could be unified. # *send* method returning CompletableFuture with semantics "send message and wait when remote node replies with result". # *sendNoAck* method returning void with semantics "send message to remote node and returns immediately when message is sent to it (written to output connection)" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14297) API to unregister HandlerProvider from network module
Sergey Chugunov created IGNITE-14297: Summary: API to unregister HandlerProvider from network module Key: IGNITE-14297 URL: https://issues.apache.org/jira/browse/IGNITE-14297 Project: Ignite Issue Type: Sub-task Reporter: Sergey Chugunov Fix For: 3.0.0-alpha2 At the moment client components can register HandlerProviders in network component but cannot unregister them. However this could be important in component lifecycle to properly stop the component. API to unregister handler from the network with clear contract about possible races (one thread unregisteres component's handler, another thread sends a message from the same component) should be implemented. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14296) Classe
Sergey Chugunov created IGNITE-14296: Summary: Classe Key: IGNITE-14296 URL: https://issues.apache.org/jira/browse/IGNITE-14296 Project: Ignite Issue Type: Sub-task Reporter: Sergey Chugunov Classes' names in network module are self-explanatory and don't need a special prefix, it could be removed to make the code more compact. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14295) Message interface to be introduced
Sergey Chugunov created IGNITE-14295: Summary: Message interface to be introduced Key: IGNITE-14295 URL: https://issues.apache.org/jira/browse/IGNITE-14295 Project: Ignite Issue Type: Sub-task Reporter: Sergey Chugunov Network module should introduce a public Message interface to handle messages to send and receive. This interface should provide at least information about message type (and possible version) to enable effective serialization/deserialization and ability to subscribe for a messages of certain type. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14231) IGNITE_ENABLE_FORCIBLE_NODE_KILL flag is not supported in inverse connection request scenario
Sergey Chugunov created IGNITE-14231: Summary: IGNITE_ENABLE_FORCIBLE_NODE_KILL flag is not supported in inverse connection request scenario Key: IGNITE-14231 URL: https://issues.apache.org/jira/browse/IGNITE-14231 Project: Ignite Issue Type: Bug Affects Versions: 2.9.1 Reporter: Sergey Chugunov Assignee: Sergey Chugunov Fix For: 2.11 IGNITE_ENABLE_FORCIBLE_NODE_KILL flag enables server nodes to forcibly kill clients visible via Discovery but unreachable by Communication protocol. This leads to infinite loops when server tries to establish communication connection to unreachable client fails and tries again effectively ignoring the flag. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14184) API for off-line update of configuration
Sergey Chugunov created IGNITE-14184: Summary: API for off-line update of configuration Key: IGNITE-14184 URL: https://issues.apache.org/jira/browse/IGNITE-14184 Project: Ignite Issue Type: Sub-task Reporter: Sergey Chugunov Tools like new CLI may include ability to view/change existing configuration without starting Ignite nodes. This may also be useful in Ignite version upgrade scenarios. Configuration module should support this case with all validations and other functionality. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14183) Cross-root validation
Sergey Chugunov created IGNITE-14183: Summary: Cross-root validation Key: IGNITE-14183 URL: https://issues.apache.org/jira/browse/IGNITE-14183 Project: Ignite Issue Type: Sub-task Reporter: Sergey Chugunov Current validation works only inside one configuration root but it is possible that properties from one root depend on properties from another. Cross-root validation should be implemented to take these cases into account. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14182) NamedList remove improvements
Sergey Chugunov created IGNITE-14182: Summary: NamedList remove improvements Key: IGNITE-14182 URL: https://issues.apache.org/jira/browse/IGNITE-14182 Project: Ignite Issue Type: Sub-task Reporter: Sergey Chugunov >From API perspective to remove from NamedList we need to nullify a particular >element in the list. On the Storage level it turns into removing all keys sitting under this particular element. Configuration engine should be responsible for cleaning up all necessary keys from Storage. Notifications should be aware of removing from NamedLists as well (e.g. one notification about removing the element from NL instead of bunch of notifications about each NL's element's field). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14181) Configuration to support arrays of primitive types
Sergey Chugunov created IGNITE-14181: Summary: Configuration to support arrays of primitive types Key: IGNITE-14181 URL: https://issues.apache.org/jira/browse/IGNITE-14181 Project: Ignite Issue Type: Sub-task Reporter: Sergey Chugunov Configuration should support declaring arrays of primitive types (e.g. arrays of addresses in IpFinder). Only primitive types are needed, for user types NamedLists should be used instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14180) Storage notification API
Sergey Chugunov created IGNITE-14180: Summary: Storage notification API Key: IGNITE-14180 URL: https://issues.apache.org/jira/browse/IGNITE-14180 Project: Ignite Issue Type: Sub-task Reporter: Sergey Chugunov Local (and in the future global) Storage should support notification mechanism: all interested components should be able to subscribe to notifications about stored keys (add, remove, update). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14178) Asynchronous Storage API
Sergey Chugunov created IGNITE-14178: Summary: Asynchronous Storage API Key: IGNITE-14178 URL: https://issues.apache.org/jira/browse/IGNITE-14178 Project: Ignite Issue Type: Sub-task Reporter: Sergey Chugunov -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14155) Test IgniteClusterIdTagTest.testInMemoryClusterTag is flaky on TC
Sergey Chugunov created IGNITE-14155: Summary: Test IgniteClusterIdTagTest.testInMemoryClusterTag is flaky on TC Key: IGNITE-14155 URL: https://issues.apache.org/jira/browse/IGNITE-14155 Project: Ignite Issue Type: Test Reporter: Sergey Chugunov Assignee: Sergey Chugunov History of the test is available [here|https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=2444565365384645281=testDetails_IgniteTests24Java8=%3Cdefault%3E]. This test is flaky but the problem is in the test itself as it synchronously asserts a condition that is intrinsically asynchronous. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSSION] Unified Configuration for Ignite 3.0
Val, Together with Semyon Danilov I did final polishing of code and merged it to the main branch in ignite-3 repo. Code assembles without any issues, tests are green, IgniteRunner starts and serves REST requests successfully. On Tue, Dec 15, 2020 at 10:37 PM Valentin Kulichenko < valentin.kuliche...@gmail.com> wrote: > Thanks, Sergey! Looks good to me. > > -Val > > On Tue, Dec 15, 2020 at 12:12 AM Sergey Chugunov < > sergey.chugu...@gmail.com> > wrote: > > > Val, > > > > Your comments make total sense to me, I've fixed them and updated pull > > request. Please take a look at my code when you have time. > > > > I also added a port range configuration to enable starting multiple > > instances of ignite without specifying port manually for each instance. > > > > -- > > Best Regards, > > Sergey Chugunov > > > > On Sat, Dec 12, 2020 at 3:20 AM Valentin Kulichenko < > > valentin.kuliche...@gmail.com> wrote: > > > > > Hi Sergey, > > > > > > Thanks for doing this. > > > > > > It looks like PR #5 is already under review, so I guess it will be > merged > > > soon. I would really love to see that, because the configuration > > framework > > > is one of the foundational components - we need it to continue building > > > Ignite 3.0. > > > > > > As for PR #6, it looks a little raw, but I believe we need it to > connect > > > the configuration framework with the CLI tool that is also pending for > > the > > > merge, is this correct? If that's the case, I think it's OK to merge > this > > > code as a separate module, with an understanding that it will change > > > significantly down the road. I would do a couple of changes though: > > > > > >1. Get rid of "simplistic-ignite" naming, as it's a little > confusing. > > >Even though it's more of a prototype at this point, it should be > clear > > > what > > >the module is responsible for. Can we rename it to "ignite-runner" > or > > >something along those lines? > > >2. Update the output - I don't like that it prints out the > > >Javalin's banner and messages. I suggest replacing this with some > very > > >basic Ignite logging: an entry showing the version of Ignite; an > entry > > >indicating that the REST protocol is enabled on a certain port; an > > entry > > >that the process is successfully started. This is just to make sure > > that > > >anyone who plays with it understands what's going on. > > > > > > Any objections? > > > > > > -Val > > > > > > On Fri, Dec 11, 2020 at 9:53 AM Sergey Chugunov < > > sergey.chugu...@gmail.com > > > > > > > wrote: > > > > > > > Hello Igniters, > > > > > > > > I would like to present two pull requests [1], [2] with basic > > > > implementation of IEP-55 for Unified Configuration [3] and IEP-63 > REST > > > API > > > > for Unified Configuration [4]. > > > > > > > > The main goal of these PRs is to present and discuss a new approach > for > > > > preparing and managing Ignite configuration in a more robust and > > > convenient > > > > way than it was before. > > > > > > > > These PRs cover basic aspects of configuration but other steps for > > > > developing functionality are already defined; ticket IGNITE-13511 [5] > > > > summarizes work to do. > > > > > > > > In a nutshell proposed approach to configuration is as follows: > > > > > > > > We want to declare configuration with POJO-based schemas that are > > concise > > > > and contain all important information about validation and how > > different > > > > pieces of configuration relate to each other. > > > > When schemas are marked with annotations annotation processor enters > > the > > > > game and generates most of boilerplate code thus freeing users from > > > writing > > > > it by hand. > > > > > > > > REST API module from [2] contains an example of managing > configuration > > > and > > > > exposing it to external tools like a Unified CLI tool presented in > [6]. > > > > > > > > [1] https://github.com/apache/ignite-3/pull/5 > > > > [2] https://github.com/apache/ignite-3/pull/6 > > > > [3] > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-55+Unified+Configuration > > > > [4] > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-63%3A+REST+API+module+to+integrate+new+modular+architecture+and+management > > > > [5] https://issues.apache.org/jira/browse/IGNITE-13511 > > > > [6] > > > > > > > > > > > > > > http://apache-ignite-developers.2346864.n4.nabble.com/DISCUSSION-Unified-CLI-tool-td50618.html > > > > > > > > > >
Re: [DISCUSSION] Unified Configuration for Ignite 3.0
Val, Your comments make total sense to me, I've fixed them and updated pull request. Please take a look at my code when you have time. I also added a port range configuration to enable starting multiple instances of ignite without specifying port manually for each instance. -- Best Regards, Sergey Chugunov On Sat, Dec 12, 2020 at 3:20 AM Valentin Kulichenko < valentin.kuliche...@gmail.com> wrote: > Hi Sergey, > > Thanks for doing this. > > It looks like PR #5 is already under review, so I guess it will be merged > soon. I would really love to see that, because the configuration framework > is one of the foundational components - we need it to continue building > Ignite 3.0. > > As for PR #6, it looks a little raw, but I believe we need it to connect > the configuration framework with the CLI tool that is also pending for the > merge, is this correct? If that's the case, I think it's OK to merge this > code as a separate module, with an understanding that it will change > significantly down the road. I would do a couple of changes though: > >1. Get rid of "simplistic-ignite" naming, as it's a little confusing. >Even though it's more of a prototype at this point, it should be clear > what >the module is responsible for. Can we rename it to "ignite-runner" or >something along those lines? >2. Update the output - I don't like that it prints out the >Javalin's banner and messages. I suggest replacing this with some very >basic Ignite logging: an entry showing the version of Ignite; an entry >indicating that the REST protocol is enabled on a certain port; an entry >that the process is successfully started. This is just to make sure that >anyone who plays with it understands what's going on. > > Any objections? > > -Val > > On Fri, Dec 11, 2020 at 9:53 AM Sergey Chugunov > > wrote: > > > Hello Igniters, > > > > I would like to present two pull requests [1], [2] with basic > > implementation of IEP-55 for Unified Configuration [3] and IEP-63 REST > API > > for Unified Configuration [4]. > > > > The main goal of these PRs is to present and discuss a new approach for > > preparing and managing Ignite configuration in a more robust and > convenient > > way than it was before. > > > > These PRs cover basic aspects of configuration but other steps for > > developing functionality are already defined; ticket IGNITE-13511 [5] > > summarizes work to do. > > > > In a nutshell proposed approach to configuration is as follows: > > > > We want to declare configuration with POJO-based schemas that are concise > > and contain all important information about validation and how different > > pieces of configuration relate to each other. > > When schemas are marked with annotations annotation processor enters the > > game and generates most of boilerplate code thus freeing users from > writing > > it by hand. > > > > REST API module from [2] contains an example of managing configuration > and > > exposing it to external tools like a Unified CLI tool presented in [6]. > > > > [1] https://github.com/apache/ignite-3/pull/5 > > [2] https://github.com/apache/ignite-3/pull/6 > > [3] > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-55+Unified+Configuration > > [4] > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-63%3A+REST+API+module+to+integrate+new+modular+architecture+and+management > > [5] https://issues.apache.org/jira/browse/IGNITE-13511 > > [6] > > > > > http://apache-ignite-developers.2346864.n4.nabble.com/DISCUSSION-Unified-CLI-tool-td50618.html > > >
[DISCUSSION] Unified Configuration for Ignite 3.0
Hello Igniters, I would like to present two pull requests [1], [2] with basic implementation of IEP-55 for Unified Configuration [3] and IEP-63 REST API for Unified Configuration [4]. The main goal of these PRs is to present and discuss a new approach for preparing and managing Ignite configuration in a more robust and convenient way than it was before. These PRs cover basic aspects of configuration but other steps for developing functionality are already defined; ticket IGNITE-13511 [5] summarizes work to do. In a nutshell proposed approach to configuration is as follows: We want to declare configuration with POJO-based schemas that are concise and contain all important information about validation and how different pieces of configuration relate to each other. When schemas are marked with annotations annotation processor enters the game and generates most of boilerplate code thus freeing users from writing it by hand. REST API module from [2] contains an example of managing configuration and exposing it to external tools like a Unified CLI tool presented in [6]. [1] https://github.com/apache/ignite-3/pull/5 [2] https://github.com/apache/ignite-3/pull/6 [3] https://cwiki.apache.org/confluence/display/IGNITE/IEP-55+Unified+Configuration [4] https://cwiki.apache.org/confluence/display/IGNITE/IEP-63%3A+REST+API+module+to+integrate+new+modular+architecture+and+management [5] https://issues.apache.org/jira/browse/IGNITE-13511 [6] http://apache-ignite-developers.2346864.n4.nabble.com/DISCUSSION-Unified-CLI-tool-td50618.html
[jira] [Created] (IGNITE-13718) REST API to manage configuration
Sergey Chugunov created IGNITE-13718: Summary: REST API to manage configuration Key: IGNITE-13718 URL: https://issues.apache.org/jira/browse/IGNITE-13718 Project: Ignite Issue Type: Sub-task Reporter: Sergey Chugunov Application developed in IGNITE-13712 should expose REST API for managing configuration and integrate with command-line tool prototype from IGNITE-13610. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13712) Simple application integrating dynamic configuration
Sergey Chugunov created IGNITE-13712: Summary: Simple application integrating dynamic configuration Key: IGNITE-13712 URL: https://issues.apache.org/jira/browse/IGNITE-13712 Project: Ignite Issue Type: Sub-task Reporter: Sergey Chugunov Assignee: Sergey Chugunov Apache Ignite node and cluster configurations include many use-cases of varying complexity. To explore how different use-cases work based on new dynamic configuration a sample application needs to be developed. The application should support basic and more complicated configurations, exact list of configurations will be provided later. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSS] Ignite 3.0 development approach
Igniters, I agree that create or not create is not a question, rephrasing Shakespeare. My main point is that developing new features on top of old 2.x-style architecture is a bad idea. We will write the code and spend some time stabilizing it (which is expected and fine). But then, when we finally decide to fix our architecture and pay our (already huge) technical debt, we will have to rewrite this code again and spend time stabilizing it again. Creating new components on top of 2.x (which is actually 1.x, nothing fundamentally new was introduced in terms of architecture) is equal to wasting time now and creating more worthless work for the future. Earlier I suggested to rank all new features according to their criticality and amount of breaking changes and shape 3.0 scope based on this analysis. Let's get back to this idea and prepare a scope based on publicly shared arguments. One more thing I would add here. Our users are smart people and make decisions about upgrading or not upgrading to a new version based on cost/value balance. Incremental approach keeps cost (public API breaking changes) high but brings questionable amounts of value with each iteration. If we add more valuable features to 3.0 and force users to pay the cost only once they will be happier than if we split really needed changes to several major releases and send our users to hell of endless rewriting their codebases. In the latter case we'll see users to be much more reluctant to upgrade to newer versions. Hope this makes sense. On Mon, Nov 16, 2020 at 2:24 PM Nikolay Izhikov wrote: > > Let's indeed focus on Sergey's suggestions on the design->development > approach. > > +1 > > > - API & configuration cleanup > > - New management tool > > - Schema-first approach > > - New replication infrastructure > > +1. > > > 16 нояб. 2020 г., в 13:40, Alexey Goncharuk > написал(а): > > > > Folks, > > > > I think we are overly driven away by the phrase 'new repo' rather than > the > > essence of my suggestion. We can keep developing in the same repo, we can > > even keep developing in the master branch. My point is that Ignite 3.0 > is a > > chance to move on with the architecture, so if we really want to make > > architectural improvements, we should not strive for incremental changes > > for *some parts of the code*. > > > > Maxim, > > > > To comment on your examples: I think that the huge effort that is > currently > > required to make any significant change in Ignite is the perfect example > of > > how we lack structure in the codebase. Yes, theoretically we can > introduce > > incremental changes in the code that will improve the structure, but my > > question is: we did not do it before, what will enforce us to make these > > changes now? With the current approach, adding a new feature increases > the > > test time non-linearly because without proper decoupling you have to test > > all possible combinations of features together. We can move faster than > > that. > > > > I also do not agree that we should reduce the scope of Ignite 3.0 that > > much. I do not see how the schema-first approach can be properly and > > reliably implemented without a reliable HA metastorage, which in turn > > requires a reliable replication protocol to be implemented. Besides, if a > > number of people want to work on some Ignite feature, why should they > wait > > because not all community members have time to review the changes? > > > > Let's indeed focus on Sergey's suggestions on the design->development > > approach. I back both Nikolay's and Maxim's scope, but I think we should > > unite them, not intersect, and the minimal list of changes to be included > > to Ignite 3.0 is: > > > > - API & configuration cleanup > > - New management tool > > - Schema-first approach > > - New replication infrastructure > > > > Any smaller subset of changes will leave Ignite 3.0 in a transient state > > with people being too afraid to move to it because there are more major > > breaking changes scheduled. > > > > пт, 13 нояб. 2020 г. в 18:28, Alexey Zinoviev : > > > >> I'm -1 for creating a new repo. > >> Also I support Maxim's plan for 3.0 > >> > >> пт, 13 нояб. 2020 г. в 15:50, Maxim Muzafarov : > >> > >>> Val, > >>> > >>> > >>> Why *creating a new repo* is the main point we faced with? Would it be > >>> better to discuss the components design approach and scope management > >>> first suggested by Sergey Chugunov? I doubt that new repo will solve > >>> move us fo
Re: [DISCUSS] Ignite 3.0 development approach
Igniters, I thought over Friday meeting ideas and concerns and summarized them in these three points: 1. *Components design unification approach.* New proposed components will be developed by different contributors, but they need to be unified and should integrate with each other easily. To ensure that I suggest calling an architecture group that will create design guidelines for all components and high-level overview of overall architecture. How code is split into components, what are component boundaries, how component lifecycle works and what are its interfaces - all these and other questions should be covered. 2. *Scope management.* Apache 3.0 should be implemented within a reasonable time, so we need some procedure to decide whether a particular feature should be dropped from the scope of 3.0 and postponed to 3.1 release. To do so I suggest to range all features by two parameters: criticality for 3.0 and amount of breaking changes. 3.0 scope should include features of high criticality AND features with a big amount of breaking changes. All other features can be made optional. 3. *Development transparency.* Development of all components should be made as transparent for everyone as possible. Any contributor should be able to look over any component at any stage of development. To achieve this I suggest to create a separate public repository dedicated for 3.0 development. It will make the code available for everyone but when development of 3.0 is done we won't loose any stars of our current repository as we merge dev repo into main one and drop dev. Do these ideas make sense to you? Are there any concerns not covered by these suggestions? On Fri, Nov 6, 2020 at 7:36 PM Kseniya Romanova wrote: > Here are the slides from Alexey Goncharuk. Let's think this over and > continue on Monday: > > https://go.gridgain.com/rs/491-TWR-806/images/Ignite_3_Plans_and_development_process.pdf > > чт, 5 нояб. 2020 г. в 11:13, Anton Vinogradov : > > > Folks, > > > > Should we perform cleanup work before (r)evolutional changes? > > My huge proposal is to get rid of things which we don't need anyway > > - local caches, > > - strange tx modes, > > - code overcomplexity because of RollingUpgrade feature never attended at > > AI, > > - etc, > > before choosing the way. > > > > On Tue, Nov 3, 2020 at 3:31 PM Valentin Kulichenko < > > valentin.kuliche...@gmail.com> wrote: > > > > > Ksenia, thanks for scheduling this on such short notice! > > > > > > As for the original topic, I do support Alexey's idea. We're not going > to > > > rewrite anything from scratch, as most of the components are going to > be > > > moved as-is or with minimal modifications. However, the changes that > are > > > proposed imply serious rework of the core parts of the code, which are > > not > > > properly decoupled from each other and from other parts. This makes the > > > incremental approach borderline impossible. Developing in a new repo, > > > however, addresses this concern. As a bonus, we can also refactor the > > code, > > > introduce better decoupling, get rid of kernel context, and develop > unit > > > tests (finally!). > > > > > > Basically, this proposal only affects the *process*, not the set of > > changes > > > we had discussed before. Ignite 3.0 is our unique chance to make things > > > right. > > > > > > -Val > > > > > > On Tue, Nov 3, 2020 at 3:06 AM Kseniya Romanova < > > romanova.ks@gmail.com > > > > > > > wrote: > > > > > > > Pavel, all the interesting points will be anyway published here in > > > English > > > > (as the principal "if it's not on devlist it doesn't happened" is > still > > > > relevant). This is just a quick call for a group of developers. Later > > we > > > > can do a separate presentation of idea and discussion in English as > we > > > did > > > > for the Ignite 3.0 draft of changes. > > > > > > > > вт, 3 нояб. 2020 г. в 13:52, Pavel Tupitsyn : > > > > > > > > > Kseniya, > > > > > > > > > > Thanks for scheduling this call. > > > > > Do you think we can switch to English if non-Russian speaking > > community > > > > > members decide to join? > > > > > > > > > > On Tue, Nov 3, 2020 at 1:32 PM Kseniya Romanova < > > > > romanova.ks@gmail.com > > > > > > > > > > > wrote: > > > > > > > > > > > Let's do this community discussion open. Here's the link on zoom > > call > > > > in > > > > > > Russian for Friday 6 PM: > > > > > > > > https://www.meetup.com/Moscow-Apache-Ignite-Meetup/events/274360378/ > > > > > > > > > > > > вт, 3 нояб. 2020 г. в 12:49, Nikolay Izhikov < > nizhi...@apache.org > > >: > > > > > > > > > > > > > Time works for me. > > > > > > > > > > > > > > > 3 нояб. 2020 г., в 12:40, Alexey Goncharuk < > > > > > alexey.goncha...@gmail.com > > > > > > > > > > > > > > написал(а): > > > > > > > > > > > > > > > > Nikolay, > > > > > > > > > > > > > > > > I am up for the call. I will try to explain my reasoning in > > > greater > > >
[jira] [Created] (IGNITE-13674) Document Persistent store defragmentation
Sergey Chugunov created IGNITE-13674: Summary: Document Persistent store defragmentation Key: IGNITE-13674 URL: https://issues.apache.org/jira/browse/IGNITE-13674 Project: Ignite Issue Type: Sub-task Reporter: Sergey Chugunov Assignee: Sergey Chugunov -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: IEP-53 Maintenance Mode: request for review
Hi Pavel, Thanks, I looked through your comments and fixed them. Could you please check one more time? On Fri, Oct 9, 2020 at 10:27 AM Pavel Tupitsyn wrote: > Hello Sergey, > > I went over the public API changes briefly and left some minor comments on > GitHub > > Thanks, > Pavel > > On Fri, Oct 9, 2020 at 9:59 AM Sergey Chugunov > wrote: > > > Hello Igniters, > > > > I'm getting closer to finishing main ticket for Maintenance Mode feature > > [1] and now working on test fixes (most likely test modifications are > > needed). > > > > So I would like to ask for a review of my pull request [2] to discuss the > > code earlier. Test status is pretty good so I expect to get a green visa > > soon. > > > > Could you please take a look? > > > > [1] https://issues.apache.org/jira/browse/IGNITE-13366 > > [2] https://github.com/apache/ignite/pull/8325 > > >
Re: Broken test in master: BasicIndexTest
Max, Thanks for spotting this, great catch! Zhenya, could you please file a ticket of at least Critical priority? On Fri, Oct 9, 2020 at 9:24 AM Zhenya Stanilovsky wrote: > > > Thanks Maxim, the test is correct no need for removal. > I checked 2.9 too, but looks it all ok there. I will take a look. > >Hi, Igniters! > > > >I was discovering how indexes work and found a failed test. > >BasicIndexTest#testInlineSizeChange is broken in master and it's not a > >flaky case [1]. But it has been failing since 25/09 only. > > > >I discovered that it happened after the IGNITE-13207 ticket merged > >(Checkpointer code refactoring) [2]. I'm not sure about the expected > >behaviour of the inline index and how checkpointer affects it. But let's > >fix it if it is a bug or completely remove this test. > > > >[1] > > > https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=6131871779633595667=%3Cdefault%3E=testDetails > > > >[2] https://issues.apache.org/jira/browse/IGNITE-13207 > > >
IEP-53 Maintenance Mode: request for review
Hello Igniters, I'm getting closer to finishing main ticket for Maintenance Mode feature [1] and now working on test fixes (most likely test modifications are needed). So I would like to ask for a review of my pull request [2] to discuss the code earlier. Test status is pretty good so I expect to get a green visa soon. Could you please take a look? [1] https://issues.apache.org/jira/browse/IGNITE-13366 [2] https://github.com/apache/ignite/pull/8325
[jira] [Created] (IGNITE-13558) GridCacheProcessor should implement better parallelization when restoring partition states on startup
Sergey Chugunov created IGNITE-13558: Summary: GridCacheProcessor should implement better parallelization when restoring partition states on startup Key: IGNITE-13558 URL: https://issues.apache.org/jira/browse/IGNITE-13558 Project: Ignite Issue Type: Improvement Components: persistence Reporter: Sergey Chugunov Fix For: 2.10 GridCacheProcessor#restorePartitionStates method tries to employ striped pool to restore partition states in parallel but level of parallelization is down only to cache group per thread. It is not enough and not utilizes resources effectively in case of one cache group much bigger than the others. We need to parallel restore process down to individual partitions to get the most from the available resources and speed up node startup. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13557) Logging improvements for PDS memory restore process
Sergey Chugunov created IGNITE-13557: Summary: Logging improvements for PDS memory restore process Key: IGNITE-13557 URL: https://issues.apache.org/jira/browse/IGNITE-13557 Project: Ignite Issue Type: Task Components: persistence Reporter: Sergey Chugunov Fix For: 2.10 During partition state restore phase of restoring memory state from disk Ignite logs a lot of useful information on debug level but very little on info. In many situations more detailed information can be useful for identification of performance issues but printing info about all partitions is impractical as it produces too much logs. The following improvements are possible though: # To identify any imbalance between partitions and find bigger-than-average partitions we should gather statistics for each partition during restore (part size and time it took to restore it). After restore we'll print information about average time and top five partitions that took the most time to restore. # To make progress of restoring visible we should print short message with intermediate progress information periodically. This should be applied when restore starts taking too long time (e.g. if restore hasn't finished in 5 minutes start printing progress each minute). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13550) CLI command to execute maintenance action in corrupted PDS scenario
Sergey Chugunov created IGNITE-13550: Summary: CLI command to execute maintenance action in corrupted PDS scenario Key: IGNITE-13550 URL: https://issues.apache.org/jira/browse/IGNITE-13550 Project: Ignite Issue Type: Task Components: control.sh Reporter: Sergey Chugunov Assignee: Sergey Chugunov Fix For: 2.10 IGNITE-13366 introduces Maintenance Mode for corrupted PDS scenario and changes previous behavior of automatic deletion of corrupted PDS files. New command is needed so user is able to get information about maintenance task and trigger needed action. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[DISCUSSION] User-facing API for managing Maintenance Mode
Hello Ignite dev community, As internal implementation of Maintenance Mode [1] is getting closer to finish I want to discuss one more thing: user-facing API (I will use control utility for examples) for managing it. What should be managed? When a node enters MM, it may start some automatic actions (like defragmentation) or wait for a user to intervene and resolve the issue (like in case of pds corruption). So for manually triggered operations like pds cleanup after corruption we should provide the user with a way to actually trigger the operation. And for long-running automatic operations like defragmentation actions like status and cancel are reasonable to implement. At the same time Maintenance Mode is a supporting feature; it doesn't bring any value by itself but enables implementation of other features. Thus putting it at the center of API and build all commands around the main "maintenance" command may not be right. There are two alternatives - "*Big features deserve their own commands*" and "*Everything should be unified*". Consider them. Big features deserve their own commands Here for each big feature we implement its own command. Defragmentation is a big separate feature so why shouldn't it have its own commands to request or cancel it? Examples *control.sh defragmentation request-for-node --nodeId [--caches ]* - defragmentation will be started on the particular node after its restart. *control.sh defragmentation status* - prints information about status of on-going defragmentation. *control.sh defragmentation cancel* - cancels on-going defragmentation. Another command - "maintenance" - will be used for more generic purposes. Examples *control.sh maintenance list-records* - prints information about each maintenance record (id and name of the record, parameters, description, current status). *control.sh maintenance record-actions --id * - prints information about user-triggered actions available for this record (e.g. for pds corruption record it may be "clean-corrupted-files") *control.sh maintenance execute-action --id --action-name * - triggers execution of particular action and prints results. *Pros:* 1. Big features like defragmentation get their own commands and more freedom in implementing them. 2. It is emphasized that maintenance mode is just a supporting thing and not a first-class feature (it is not at the center of API). *Cons:* 1. Duplication of functionality. The same functions may be available via general maintenance command and a separate command of the feature. 2. Information about a feature may be split into two commands. One piece of information is available in the "feature" command, another in the "maintenance" command. Everything should be unified We can go another way and gather all features that rely on MM under one unified command. API for node that is already in MM looks complete and logical, very intuitive: *control.sh maintenance list-records* - output all records that have to be resolved to finish maintenance. *control.sh maintenance record-actions --id * - all actions available for the record. *control.sh maintenance execute-action --id --action-name * - executes action of the given name (like general actions "status" or "delete" and more specific action "clean-corrupted-files" for corrupted pds situation). But API to request node to enter maintenance mode becomes more vague. *control.sh maintenance available-operations* - prints all operations available to request (for instance, defragmentation). control.sh maintenance request-operation --id --params - requests given operation to start on next node restart. Here we have to distinguish operations that are requested automatically (like pds corruption) and not show them to the user. *Pros:* 1. Single API to get information and trigger actions without any duplication. *Cons:* 1. We restrict big features by model provided by maintenance command. 2. In this API we put maintenance in the center although it is nothing more than a supporting feature. 3. API to request maintenance operations doesn't feel intuitive to me but more artificial. So what do you think? What looks better and more intuitive from your perspective? I will be glad to hear any feedback on the subject. As a result of this discussion I will create a ticket for implementation and include it into IEP-53 [2] [1] https://issues.apache.org/jira/browse/IGNITE-13366 [2] https://cwiki.apache.org/confluence/display/IGNITE/IEP-53%3A+Maintenance+Mode
Re: [DISCUSSION] Maintenance Mode feature
Hello Nikolay, > AFAIKU There is third use-case for this mode. Sorry for the late reply. I took a look at the code and maintenance mode indeed looks a good match for changing master key situation. I want to clarify only one thing. In current implementation we pass new master key name via system property. Do you think of getting rid of this property and passing new master key name to encryption manager with maintenance parameters? In terms of original IEP it is parameters passed with MaintenanceRecord. -- Thanks! On Mon, Sep 21, 2020 at 3:20 PM Nikolay Izhikov wrote: > Hello, Sergey. > > > At the moment I'm aware about two use cases for this feature: corrupted > PDS cleanup and defragmentation. > > AFAIKU There is third use-case for this mode. > > Change encryption master key in case node was down during cluster master > key change. > In this case, node can’t join to the cluster, because it’s master key > differs from the cluster. > To recover node Ignite should locally change master key before join. > > Please, take a look into source code [1] > > [1] > https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/managers/encryption/GridEncryptionManager.java#L710 > > > 21 сент. 2020 г., в 14:37, Sergey Chugunov > написал(а): > > > > Ivan, > > > > Sorry for some confusion, MM indeed is not a normal mode. What I was > trying > > to say is that when in MM node still starts and allows the user to > perform > > actions with it like sending commands via control utility/JMX APIs or > > reading metrics. > > > > This is the key point: although the node is not in the cluster but it is > > still alive can be monitored and supports management to do maintenance. > > > > From the code complexity perspective I'm trying to design the feature in > > such a way that all maintenance code is as encapsulated as possible and > > avoids massive interventions into main workflows of components. > > At the moment I'm aware about two use cases for this feature: corrupted > PDS > > cleanup and defragmentation. As far as I know it won't bring too much > > complexity in both cases. > > > > I cannot say for other components but I believe it will be possible to > > integrate MM feature into their workflow as well with reasonable amount > of > > refactoring. > > > > Does it make sense to you? > > > > On Sun, Sep 6, 2020 at 8:08 AM Ivan Pavlukhin > wrote: > > > >> Sergey, > >> > >> Thank you for your answer! > >> > >> Might be I am looking at the subject from a different angle. > >> > >>> I think of a node in MM as an almost normal one > >> I cannot think of such a mode as a normal one, because it apparently > >> does not perform usual cluster node functions. It is not a part of a > >> cluster, caches data is not available, Discovery and Communication are > >> not needed. > >> > >> I fear that with "node started in a special mode" approach we will get > >> an additional flag in the code making the code more complex and > >> fragile. Should not I worry about it? > >> > >> 2020-09-02 10:45 GMT+03:00, Sergey Chugunov >: > >>> Vladislav, Ivan, > >>> > >>> Thank you for your questions and suggestions. Let me answer them. > >>> > >>> Vladislav, > >>> > >>> If I understood you correctly, you're talking about a node performing > >> some > >>> automatic actions to fix the problem and then join the cluster as > usual. > >>> > >>> However the original ticket [1] where we faced the need for Maintenance > >>> Mode is about exactly the opposite: avoid doing automatic actions and > >> give > >>> a user the ability to decide what to do. > >>> > >>> Also the idea of Maintenance Mode is that the node is able to accept > >>> commands, expose metrics and so on, thus we need all components to be > >>> initialized (some of them may be partially initialized due to their own > >>> maintenance). > >>> To achieve that we need to go through a full cycle of node > initialization > >>> including discovery initialization. When discovery is initialized (in > >>> special isolated mode) I don't think it is easy to switch back to > normal > >>> operations without a restart. > >>> > >>> Ivan, > >>> > >>> I think of a node in MM as an almost normal one (maybe with some > >> components
Re: [DISCUSSION] Maintenance Mode feature
Ivan, If you come up with any ideas that may make this feature better, don't hesitate to share them! Thank you! On Tue, Sep 22, 2020 at 11:27 AM Ivan Pavlukhin wrote: > Sergey, > > Thank you for your answer. While I am not happy with the proposed > approach but things never were easy. Unfortunately I cannot suggest > 100% better approaches so far. So, I should trust your vision. > > 2020-09-22 10:29 GMT+03:00, Sergey Chugunov : > > Ivan, > > > > Checkpointer in Maintenance Mode is started and allows normal operations > as > > it may be needed for defragmentation and possibly other cases. > > > > Discovery is started with a special implementation of SPI that doesn't > make > > attempts to seek and/or connect to the rest of the cluster. From that > > perspective node in MM is totally isolated. > > > > Communication is started as usual but I believe it doesn't matter as > > discovery no other nodes are observed in topology and connection attempt > > should not happen. But it may make sense to implement isolated version of > > communication SPI as well to have 100% guarantee that no communication > with > > other nodes will happen. > > > > It is important to note that GridRestProcessor is started normally as we > > need it to connect to the node via control utility. > > > > On Mon, Sep 21, 2020 at 7:04 PM Ivan Pavlukhin > wrote: > > > >> Sergey, > >> > >> > From the code complexity perspective I'm trying to design the feature > >> in such a way that all maintenance code is as encapsulated as possible > >> and > >> avoids massive interventions into main workflows of components. > >> > >> Could please briefly tell what means do you use to achieve > >> encapsulation? Are Discovery, Communication, Checkpointer and other > >> components started in a maintenance mode in current design? > >> > >> 2020-09-21 15:19 GMT+03:00, Nikolay Izhikov : > >> > Hello, Sergey. > >> > > >> >> At the moment I'm aware about two use cases for this feature: > >> >> corrupted > >> >> PDS cleanup and defragmentation. > >> > > >> > AFAIKU There is third use-case for this mode. > >> > > >> > Change encryption master key in case node was down during cluster > >> > master > >> key > >> > change. > >> > In this case, node can’t join to the cluster, because it’s master key > >> > differs from the cluster. > >> > To recover node Ignite should locally change master key before join. > >> > > >> > Please, take a look into source code [1] > >> > > >> > [1] > >> > > >> > https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/managers/encryption/GridEncryptionManager.java#L710 > >> > > >> >> 21 сент. 2020 г., в 14:37, Sergey Chugunov < > sergey.chugu...@gmail.com> > >> >> написал(а): > >> >> > >> >> Ivan, > >> >> > >> >> Sorry for some confusion, MM indeed is not a normal mode. What I was > >> >> trying > >> >> to say is that when in MM node still starts and allows the user to > >> >> perform > >> >> actions with it like sending commands via control utility/JMX APIs or > >> >> reading metrics. > >> >> > >> >> This is the key point: although the node is not in the cluster but it > >> >> is > >> >> still alive can be monitored and supports management to do > >> >> maintenance. > >> >> > >> >> From the code complexity perspective I'm trying to design the > feature > >> in > >> >> such a way that all maintenance code is as encapsulated as possible > >> >> and > >> >> avoids massive interventions into main workflows of components. > >> >> At the moment I'm aware about two use cases for this feature: > >> >> corrupted > >> >> PDS > >> >> cleanup and defragmentation. As far as I know it won't bring too much > >> >> complexity in both cases. > >> >> > >> >> I cannot say for other components but I believe it will be possible > to > >> >> integrate MM feature into their workflow as well with reasonable > >> >> amount > >> >> of > >> >> refactoring. > >> >> > >> >
Re: [DISCUSSION] Maintenance Mode feature
Ivan, Checkpointer in Maintenance Mode is started and allows normal operations as it may be needed for defragmentation and possibly other cases. Discovery is started with a special implementation of SPI that doesn't make attempts to seek and/or connect to the rest of the cluster. From that perspective node in MM is totally isolated. Communication is started as usual but I believe it doesn't matter as discovery no other nodes are observed in topology and connection attempt should not happen. But it may make sense to implement isolated version of communication SPI as well to have 100% guarantee that no communication with other nodes will happen. It is important to note that GridRestProcessor is started normally as we need it to connect to the node via control utility. On Mon, Sep 21, 2020 at 7:04 PM Ivan Pavlukhin wrote: > Sergey, > > > From the code complexity perspective I'm trying to design the feature > in such a way that all maintenance code is as encapsulated as possible and > avoids massive interventions into main workflows of components. > > Could please briefly tell what means do you use to achieve > encapsulation? Are Discovery, Communication, Checkpointer and other > components started in a maintenance mode in current design? > > 2020-09-21 15:19 GMT+03:00, Nikolay Izhikov : > > Hello, Sergey. > > > >> At the moment I'm aware about two use cases for this feature: corrupted > >> PDS cleanup and defragmentation. > > > > AFAIKU There is third use-case for this mode. > > > > Change encryption master key in case node was down during cluster master > key > > change. > > In this case, node can’t join to the cluster, because it’s master key > > differs from the cluster. > > To recover node Ignite should locally change master key before join. > > > > Please, take a look into source code [1] > > > > [1] > > > https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/managers/encryption/GridEncryptionManager.java#L710 > > > >> 21 сент. 2020 г., в 14:37, Sergey Chugunov > >> написал(а): > >> > >> Ivan, > >> > >> Sorry for some confusion, MM indeed is not a normal mode. What I was > >> trying > >> to say is that when in MM node still starts and allows the user to > >> perform > >> actions with it like sending commands via control utility/JMX APIs or > >> reading metrics. > >> > >> This is the key point: although the node is not in the cluster but it is > >> still alive can be monitored and supports management to do maintenance. > >> > >> From the code complexity perspective I'm trying to design the feature > in > >> such a way that all maintenance code is as encapsulated as possible and > >> avoids massive interventions into main workflows of components. > >> At the moment I'm aware about two use cases for this feature: corrupted > >> PDS > >> cleanup and defragmentation. As far as I know it won't bring too much > >> complexity in both cases. > >> > >> I cannot say for other components but I believe it will be possible to > >> integrate MM feature into their workflow as well with reasonable amount > >> of > >> refactoring. > >> > >> Does it make sense to you? > >> > >> On Sun, Sep 6, 2020 at 8:08 AM Ivan Pavlukhin > >> wrote: > >> > >>> Sergey, > >>> > >>> Thank you for your answer! > >>> > >>> Might be I am looking at the subject from a different angle. > >>> > >>>> I think of a node in MM as an almost normal one > >>> I cannot think of such a mode as a normal one, because it apparently > >>> does not perform usual cluster node functions. It is not a part of a > >>> cluster, caches data is not available, Discovery and Communication are > >>> not needed. > >>> > >>> I fear that with "node started in a special mode" approach we will get > >>> an additional flag in the code making the code more complex and > >>> fragile. Should not I worry about it? > >>> > >>> 2020-09-02 10:45 GMT+03:00, Sergey Chugunov >: > >>>> Vladislav, Ivan, > >>>> > >>>> Thank you for your questions and suggestions. Let me answer them. > >>>> > >>>> Vladislav, > >>>> > >>>> If I understood you correctly, you're talking about a node performing > >>> some > >>>> automatic actions to fix th
Re: [DISCUSSION] Maintenance Mode feature
Ivan, Sorry for some confusion, MM indeed is not a normal mode. What I was trying to say is that when in MM node still starts and allows the user to perform actions with it like sending commands via control utility/JMX APIs or reading metrics. This is the key point: although the node is not in the cluster but it is still alive can be monitored and supports management to do maintenance. >From the code complexity perspective I'm trying to design the feature in such a way that all maintenance code is as encapsulated as possible and avoids massive interventions into main workflows of components. At the moment I'm aware about two use cases for this feature: corrupted PDS cleanup and defragmentation. As far as I know it won't bring too much complexity in both cases. I cannot say for other components but I believe it will be possible to integrate MM feature into their workflow as well with reasonable amount of refactoring. Does it make sense to you? On Sun, Sep 6, 2020 at 8:08 AM Ivan Pavlukhin wrote: > Sergey, > > Thank you for your answer! > > Might be I am looking at the subject from a different angle. > > > I think of a node in MM as an almost normal one > I cannot think of such a mode as a normal one, because it apparently > does not perform usual cluster node functions. It is not a part of a > cluster, caches data is not available, Discovery and Communication are > not needed. > > I fear that with "node started in a special mode" approach we will get > an additional flag in the code making the code more complex and > fragile. Should not I worry about it? > > 2020-09-02 10:45 GMT+03:00, Sergey Chugunov : > > Vladislav, Ivan, > > > > Thank you for your questions and suggestions. Let me answer them. > > > > Vladislav, > > > > If I understood you correctly, you're talking about a node performing > some > > automatic actions to fix the problem and then join the cluster as usual. > > > > However the original ticket [1] where we faced the need for Maintenance > > Mode is about exactly the opposite: avoid doing automatic actions and > give > > a user the ability to decide what to do. > > > > Also the idea of Maintenance Mode is that the node is able to accept > > commands, expose metrics and so on, thus we need all components to be > > initialized (some of them may be partially initialized due to their own > > maintenance). > > To achieve that we need to go through a full cycle of node initialization > > including discovery initialization. When discovery is initialized (in > > special isolated mode) I don't think it is easy to switch back to normal > > operations without a restart. > > > > Ivan, > > > > I think of a node in MM as an almost normal one (maybe with some > components > > skipped some steps of their initialization). Commands are accepted, > > appropriate metrics are exposed e.g. through JMX API and so on. > > > > So as I see it we'll have special commands for control.{sh|bat} CLI > > allowing user to see reasons why node switched to maintenance mode and/or > > trigger actions to fix the problem (I'm still thinking about proper > design > > of these actions though). > > > > Of course the user should also be able to fix the problem manually e.g. > by > > manually deleting corrupted PDS files when node is down. Ideally > > Maintenance Mode should be smart enough to figure that out and switch to > > normal operations without a restart but I'm not sure if it is possible > > without invasive changes of our components' lifecycle. > > So I believe this model (node truly started in Maintenance Mode and new > > commands in control.{sh|bat}) is a good fit for our current APIs and ways > > to interact with the node. > > > > Does it sound reasonable to you? > > > > Thank you! > > > > [1] https://issues.apache.org/jira/browse/IGNITE-13366 > > > > On Tue, Sep 1, 2020 at 2:07 PM Ivan Pavlukhin > wrote: > > > >> Sergey, > >> > >> Actually, I missed the point that the discussed mode affects a single > >> node but not a whole cluster. Perhaps I mixed terms "mode" and > >> "state". > >> > >> My next thoughts about maintenance routines are about special > >> utilities. As far as I remember MySQL provides a bunch of scripts for > >> various maintenance purposes. What user interface for maintenance > >> tasks execution is assumed? And what do we mean by "starting" a node > >> in a maintenance mode? Can we do some routines without "starting" > >> (e.g. try to recover PDS or cleanup)? > &g
Re: [DISCUSSION] Maintenance Mode feature
Vladislav, Ivan, Thank you for your questions and suggestions. Let me answer them. Vladislav, If I understood you correctly, you're talking about a node performing some automatic actions to fix the problem and then join the cluster as usual. However the original ticket [1] where we faced the need for Maintenance Mode is about exactly the opposite: avoid doing automatic actions and give a user the ability to decide what to do. Also the idea of Maintenance Mode is that the node is able to accept commands, expose metrics and so on, thus we need all components to be initialized (some of them may be partially initialized due to their own maintenance). To achieve that we need to go through a full cycle of node initialization including discovery initialization. When discovery is initialized (in special isolated mode) I don't think it is easy to switch back to normal operations without a restart. Ivan, I think of a node in MM as an almost normal one (maybe with some components skipped some steps of their initialization). Commands are accepted, appropriate metrics are exposed e.g. through JMX API and so on. So as I see it we'll have special commands for control.{sh|bat} CLI allowing user to see reasons why node switched to maintenance mode and/or trigger actions to fix the problem (I'm still thinking about proper design of these actions though). Of course the user should also be able to fix the problem manually e.g. by manually deleting corrupted PDS files when node is down. Ideally Maintenance Mode should be smart enough to figure that out and switch to normal operations without a restart but I'm not sure if it is possible without invasive changes of our components' lifecycle. So I believe this model (node truly started in Maintenance Mode and new commands in control.{sh|bat}) is a good fit for our current APIs and ways to interact with the node. Does it sound reasonable to you? Thank you! [1] https://issues.apache.org/jira/browse/IGNITE-13366 On Tue, Sep 1, 2020 at 2:07 PM Ivan Pavlukhin wrote: > Sergey, > > Actually, I missed the point that the discussed mode affects a single > node but not a whole cluster. Perhaps I mixed terms "mode" and > "state". > > My next thoughts about maintenance routines are about special > utilities. As far as I remember MySQL provides a bunch of scripts for > various maintenance purposes. What user interface for maintenance > tasks execution is assumed? And what do we mean by "starting" a node > in a maintenance mode? Can we do some routines without "starting" > (e.g. try to recover PDS or cleanup)? > > 2020-08-31 23:41 GMT+03:00, Vladislav Pyatkov : > > Hi Sergey. > > > > As I understand any switching from/to MM possible only through manual > > restart a node. > > But in your example that look like a technical actions, that only > possible > > in the case. > > Do you plan to provide a possibility for client where he can make a > > decision without a manual intervention? > > > > For example: Start node and manually agree with an option and after > > automatically resolve conflict and back to topology as a stable node. > > > > On Mon, Aug 31, 2020 at 5:41 PM Sergey Chugunov < > sergey.chugu...@gmail.com> > > wrote: > > > >> Hello Ivan, > >> > >> Thank you for raising the good question, I didn't think of Maintenance > >> Mode > >> from that perspective. > >> > >> In short, Maintenance Mode isn't related to Cluster States concept. > >> According to javadoc documentation of ClusterState enum [1] it is solely > >> about cache operations and to some extent doesn't affect other > components > >> of Ignite node. > >> From APIs perspective putting the methods to manage Cluster State to > >> IgniteCluster interface doesn't look ideal to me but it is as it is. > >> > >> On the other hand Maintenance Mode as I see it will be managed through > >> different APIs than a ClusterState and this difference definitely will > be > >> reflected in the documentation of the feature. > >> > >> Ignite node is a complex piece of many components interacting with each > >> other, they may have different lifecycles and states; states of > different > >> components cannot be reduced to the lowest common denominator. > >> > >> However if you have an idea of how to call the feature better to let the > >> user easier distinguish it from other similar features please share it > >> with > >> us. Personally I'm very welcome to any suggestions that make design more > >> intuitive and easy-to-use. > >> > >> Thanks! > >> > >> [1] > >> >
Re: [DISCUSSION] Maintenance Mode feature
Hello Ivan, Thank you for raising the good question, I didn't think of Maintenance Mode from that perspective. In short, Maintenance Mode isn't related to Cluster States concept. According to javadoc documentation of ClusterState enum [1] it is solely about cache operations and to some extent doesn't affect other components of Ignite node. >From APIs perspective putting the methods to manage Cluster State to IgniteCluster interface doesn't look ideal to me but it is as it is. On the other hand Maintenance Mode as I see it will be managed through different APIs than a ClusterState and this difference definitely will be reflected in the documentation of the feature. Ignite node is a complex piece of many components interacting with each other, they may have different lifecycles and states; states of different components cannot be reduced to the lowest common denominator. However if you have an idea of how to call the feature better to let the user easier distinguish it from other similar features please share it with us. Personally I'm very welcome to any suggestions that make design more intuitive and easy-to-use. Thanks! [1] https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/cluster/ClusterState.java On Mon, Aug 31, 2020 at 12:32 PM Ivan Pavlukhin wrote: > Hi Sergey, > > Thank you for bringing attention to that important subject! > > My note here is about one more cluster mode. As far as I know > currently we already have 3 modes (inactive, read-only, read-write) > and the subject is about one more. From the first glance it could be > hard for a user to understand and use all modes properly. Do we really > need all spectrum? Could we simplify things somehow? > > 2020-08-27 15:59 GMT+03:00, Sergey Chugunov : > > Hello Nikolay, > > > > Created one, available by link [1] > > > > Initially there was an intention to develop it under IEP-47 [2] and there > > is even a separate section for Maintenance Mode there. > > But it looks like this feature is useful in more cases and deserves its > own > > IEP. > > > > [1] > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-53%3A+Maintenance+Mode > > [2] > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation > > > > On Thu, Aug 27, 2020 at 11:01 AM Nikolay Izhikov > > wrote: > > > >> Hello, Sergey! > >> > >> Thanks for the proposal. > >> Let’s have IEP for this feature. > >> > >> > 27 авг. 2020 г., в 10:25, Sergey Chugunov > >> написал(а): > >> > > >> > Hello Igniters, > >> > > >> > I want to start a discussion about new supporting feature that could > be > >> > very useful in many scenarios where persistent storage is involved: > >> > Maintenance Mode. > >> > > >> > *Summary* > >> > Maintenance Mode (MM for short) is a special state of Ignite node when > >> node > >> > doesn't serve user requests nor joins the cluster but waits for user > >> > commands or performs automatic actions for maintenance purposes. > >> > > >> > *Motivation* > >> > There are situations when node cannot participate in regular > operations > >> but > >> > at the same time should not be shut down. > >> > > >> > One example is a ticket [1] where I developed the first draft of > >> > Maintenance Mode. > >> > Here we get into a situation when node has potentially corrupted PDS > >> > thus > >> > cannot proceed with restore routine and join the cluster as usual. > >> > At the same time node should not fail nor be stopped for manual > >> > cleanup. > >> > Manual cleanup is not always an option (e.g. restricted access to file > >> > system); in managed environments failed node will be restarted > >> > automatically so user won't have time for performing necessary > >> operations. > >> > Thus node needs to function in a special mode allowing user to connect > >> > to > >> > it and perform necessary actions. > >> > > >> > Another example is described in IEP-47 [2] where defragmentation is > >> > being > >> > developed. Node defragmenting its PDS should not join the cluster > until > >> the > >> > process is finished so it needs to enter Maintenance Mode as well. > >> > > >> > *Suggested design* > >> > I suggest MM to work as follows: > >> > 1. Node enters MM if special markers are found on disk. These
Re: [DISCUSSION] Maintenance Mode feature
Hello Nikolay, Created one, available by link [1] Initially there was an intention to develop it under IEP-47 [2] and there is even a separate section for Maintenance Mode there. But it looks like this feature is useful in more cases and deserves its own IEP. [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-53%3A+Maintenance+Mode [2] https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation On Thu, Aug 27, 2020 at 11:01 AM Nikolay Izhikov wrote: > Hello, Sergey! > > Thanks for the proposal. > Let’s have IEP for this feature. > > > 27 авг. 2020 г., в 10:25, Sergey Chugunov > написал(а): > > > > Hello Igniters, > > > > I want to start a discussion about new supporting feature that could be > > very useful in many scenarios where persistent storage is involved: > > Maintenance Mode. > > > > *Summary* > > Maintenance Mode (MM for short) is a special state of Ignite node when > node > > doesn't serve user requests nor joins the cluster but waits for user > > commands or performs automatic actions for maintenance purposes. > > > > *Motivation* > > There are situations when node cannot participate in regular operations > but > > at the same time should not be shut down. > > > > One example is a ticket [1] where I developed the first draft of > > Maintenance Mode. > > Here we get into a situation when node has potentially corrupted PDS thus > > cannot proceed with restore routine and join the cluster as usual. > > At the same time node should not fail nor be stopped for manual cleanup. > > Manual cleanup is not always an option (e.g. restricted access to file > > system); in managed environments failed node will be restarted > > automatically so user won't have time for performing necessary > operations. > > Thus node needs to function in a special mode allowing user to connect to > > it and perform necessary actions. > > > > Another example is described in IEP-47 [2] where defragmentation is being > > developed. Node defragmenting its PDS should not join the cluster until > the > > process is finished so it needs to enter Maintenance Mode as well. > > > > *Suggested design* > > I suggest MM to work as follows: > > 1. Node enters MM if special markers are found on disk. These markers > > called Maintenance Records could be created automatically (e.g. when > > storage component detects corrupted storage) or by user request (when > user > > requests defragmentation of some caches). So entering MM requires node > > restart. > > 2. Started in MM node doesn't join the cluster but finishes startup > routine > > so it is able to receive commands and provide metrics to the user. > > 3. When all necessary maintenance operations are finished, Maintenance > > Records for these operations are deleted from disk and node restarted > again > > to enter normal service. > > > > *Example* > > To put it into a context let's consider an example of how I see the MM > > workflow in case of PDS corruption. > > > > 1. Node has failed in the middle of checkpoint when WAL is disabled for > > a particular cache -> data files of the cache are potentially > corrupted. > > 2. On next startup node detects this situation, creates Maintenance > > Record on disk and shuts down. > > 3. On next startup node sees Maintenance Record, enters Maintenance > Mode > > and waits for user to do specific actions: clean potentially corrupted > PDS. > > 4. When user has done necessary actions he/she removes Maintenance > > Record using Maintenance Mode API exposed via control.{sh|bat} script > or > > JMX. > > 5. On next startup node goes to normal operations as maintenance reason > > is fixed. > > > > > > I prepared a PR [3] for ticket [1] with draft implementation. It is not > > ready to be merged to master branch but is already fully functional and > can > > be reviewed. > > > > Hope you'll share your feedback on the feature and/or any thoughts on > > implementation. > > > > Thank you! > > > > [1] https://issues.apache.org/jira/browse/IGNITE-13366 > > [2] > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation > > [3] https://github.com/apache/ignite/pull/8189 > >
[DISCUSSION] Maintenance Mode feature
Hello Igniters, I want to start a discussion about new supporting feature that could be very useful in many scenarios where persistent storage is involved: Maintenance Mode. *Summary* Maintenance Mode (MM for short) is a special state of Ignite node when node doesn't serve user requests nor joins the cluster but waits for user commands or performs automatic actions for maintenance purposes. *Motivation* There are situations when node cannot participate in regular operations but at the same time should not be shut down. One example is a ticket [1] where I developed the first draft of Maintenance Mode. Here we get into a situation when node has potentially corrupted PDS thus cannot proceed with restore routine and join the cluster as usual. At the same time node should not fail nor be stopped for manual cleanup. Manual cleanup is not always an option (e.g. restricted access to file system); in managed environments failed node will be restarted automatically so user won't have time for performing necessary operations. Thus node needs to function in a special mode allowing user to connect to it and perform necessary actions. Another example is described in IEP-47 [2] where defragmentation is being developed. Node defragmenting its PDS should not join the cluster until the process is finished so it needs to enter Maintenance Mode as well. *Suggested design* I suggest MM to work as follows: 1. Node enters MM if special markers are found on disk. These markers called Maintenance Records could be created automatically (e.g. when storage component detects corrupted storage) or by user request (when user requests defragmentation of some caches). So entering MM requires node restart. 2. Started in MM node doesn't join the cluster but finishes startup routine so it is able to receive commands and provide metrics to the user. 3. When all necessary maintenance operations are finished, Maintenance Records for these operations are deleted from disk and node restarted again to enter normal service. *Example* To put it into a context let's consider an example of how I see the MM workflow in case of PDS corruption. 1. Node has failed in the middle of checkpoint when WAL is disabled for a particular cache -> data files of the cache are potentially corrupted. 2. On next startup node detects this situation, creates Maintenance Record on disk and shuts down. 3. On next startup node sees Maintenance Record, enters Maintenance Mode and waits for user to do specific actions: clean potentially corrupted PDS. 4. When user has done necessary actions he/she removes Maintenance Record using Maintenance Mode API exposed via control.{sh|bat} script or JMX. 5. On next startup node goes to normal operations as maintenance reason is fixed. I prepared a PR [3] for ticket [1] with draft implementation. It is not ready to be merged to master branch but is already fully functional and can be reviewed. Hope you'll share your feedback on the feature and/or any thoughts on implementation. Thank you! [1] https://issues.apache.org/jira/browse/IGNITE-13366 [2] https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation [3] https://github.com/apache/ignite/pull/8189
[jira] [Created] (IGNITE-13367) meta --remove command usage improvements
Sergey Chugunov created IGNITE-13367: Summary: meta --remove command usage improvements Key: IGNITE-13367 URL: https://issues.apache.org/jira/browse/IGNITE-13367 Project: Ignite Issue Type: Improvement Reporter: Sergey Chugunov Assignee: Sergey Chugunov Fix For: 2.10 Command for removing metadata has the following issues: # In 'Type not found' scenario it prints long stack traces to console instead of short information about requested type. # When used it registers some internal classes which are not supposed to go through binary metadata registration protocol. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13366) Prohibit unconditional automatic deletion of data files if WAL was disabled prior to node's shutdown
Sergey Chugunov created IGNITE-13366: Summary: Prohibit unconditional automatic deletion of data files if WAL was disabled prior to node's shutdown Key: IGNITE-13366 URL: https://issues.apache.org/jira/browse/IGNITE-13366 Project: Ignite Issue Type: Task Components: persistence Affects Versions: 2.8.1 Reporter: Sergey Chugunov Assignee: Sergey Chugunov Fix For: 2.10 If node with persistence is stopped when WAL was disabled for a cache (no matters because of rebalancing in progress or by explicit user request) on next node start all data files of that cache are removed automatically and unconditionally. This behavior may be unexpected for users as they may not understand all consequences of disabling WAL locally (for rebalancing) or globally (via IgniteCluster API call). Also it is not smart enough as there is no point in deleting consistent data files. We should change this behavior to the following list: no automatic deletions whatsoever. If data files are consistent (equivalent to: no checkpoint was running when node was stopped) start up normally. If data files are corrupted, don't let the node start. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13260) Improve javadoc documentation for FilePageStore abstraction.
Sergey Chugunov created IGNITE-13260: Summary: Improve javadoc documentation for FilePageStore abstraction. Key: IGNITE-13260 URL: https://issues.apache.org/jira/browse/IGNITE-13260 Project: Ignite Issue Type: Task Reporter: Sergey Chugunov Fix For: 2.10 FilePageStore class javadoc comment doesn't provide any useful information about role of this important class in the whole picture of Ignite Native Persistence. We need to add information about responsibilities of the class and its relationships with other classes in Ignite Persistence module. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13239) Document APIs to view and change Cluster ID and Tag
Sergey Chugunov created IGNITE-13239: Summary: Document APIs to view and change Cluster ID and Tag Key: IGNITE-13239 URL: https://issues.apache.org/jira/browse/IGNITE-13239 Project: Ignite Issue Type: Task Reporter: Sergey Chugunov In IGNITE-13185 new APIs and changes were introduced to view Cluster ID and Tag and change Tag. These APIs and use cases need to be documented. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13212) Peer class loading does not work for Scan Query
Sergey Chugunov created IGNITE-13212: Summary: Peer class loading does not work for Scan Query Key: IGNITE-13212 URL: https://issues.apache.org/jira/browse/IGNITE-13212 Project: Ignite Issue Type: Bug Reporter: Sergey Chugunov Assignee: Sergey Chugunov Fix For: 2.9 When a scan query with transformer is executed via API {{IgniteCache::query}} and class passed as a transformer is not available on remote nodes, p2p mechanism is not triggered and exception is thrown on server nodes executing query. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13190) Core defragmentation functions
Sergey Chugunov created IGNITE-13190: Summary: Core defragmentation functions Key: IGNITE-13190 URL: https://issues.apache.org/jira/browse/IGNITE-13190 Project: Ignite Issue Type: Sub-task Reporter: Sergey Chugunov The following set of functions covering defragmentation happy-case needed: * Initialization of defragmentation manager when node is started in maintenance mode. * Information about partition files is gathered by defrag mgr. * For each partition file corresponding file of defragmented partition is created and initialized. * Keys are transferred from old partitions to new partitions. * Checkpointer is aware of new partition files and flushes defragmented memory to new partition files. No fault-tolerance code nor index defragmentation mappings are needed in this task. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13189) Maintenance mode switch and defragmentation process initialization
Sergey Chugunov created IGNITE-13189: Summary: Maintenance mode switch and defragmentation process initialization Key: IGNITE-13189 URL: https://issues.apache.org/jira/browse/IGNITE-13189 Project: Ignite Issue Type: Sub-task Reporter: Sergey Chugunov As described in IEP-47 defragmentation is performed when a node enters a special mode called maintenance mode. Discussion on dev-list clarifies algorithm to enter maintenance mode: # Special key is written to local metastorage. # Node is restarted. # Node observes the key on startup and enters maintenance mode. Node should be fully-functioning in that mode but should not join the rest of the cluster and participate in any regular activity like handling cache operations. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13185) API to change Cluster Tag and notify about change of Cluster Tag
Sergey Chugunov created IGNITE-13185: Summary: API to change Cluster Tag and notify about change of Cluster Tag Key: IGNITE-13185 URL: https://issues.apache.org/jira/browse/IGNITE-13185 Project: Ignite Issue Type: Improvement Reporter: Sergey Chugunov Assignee: Sergey Chugunov IGNITE-12111 introduced new feature to identify and distinguish different clusters. To make the feature more usable we need new command in CLI interface to change Cluster Tag and new event to subscribe for changes of Cluster Tag. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13182) Document Cluster ID and Tag feature
Sergey Chugunov created IGNITE-13182: Summary: Document Cluster ID and Tag feature Key: IGNITE-13182 URL: https://issues.apache.org/jira/browse/IGNITE-13182 Project: Ignite Issue Type: Task Reporter: Sergey Chugunov IGNITE-12111 introduced new feature to identify and give a name to the cluster: Cluster ID and Tag. Feature in general and APIs to manage it in particular need to be documented. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSSION] New Ignite settings for IGNITE-12438 and IGNITE-13013
Val, I like your suggestion about naming, it describes the purpose of the configuration the best. On Tue, Jun 16, 2020 at 5:18 PM Ivan Bessonov wrote: > Hi, > > I created new issue that describes CQ problem in more details: [1] > I'm fine with experimental status and new property naming. > > [1] https://issues.apache.org/jira/browse/IGNITE-13156 > > вт, 16 июн. 2020 г. в 02:20, Valentin Kulichenko < > valentin.kuliche...@gmail.com>: > > > Folks, > > > > Thanks for providing the detailed clarifications. Let's add the > parameter, > > mark the new feature as experimental, and target for making it the > default > > mode in Ignite 3.0. > > > > I still don't think we can come up with a naming that is really > intuitive, > > but let's try to simplify it as much as possible. How about this: > > > > TcpCommunicationSpi#forceClientToServerConnections -- false by default, > > true if the new mode needs to be enabled. > > > > Let me know your thoughts. > > > > -Val > > > > On Wed, Jun 10, 2020 at 4:10 PM Denis Magda wrote: > > > > > Sergey, > > > > > > Thanks for the detailed explanation and for covering all corner cases. > > > > > > Considering the improvement's criticality, I would continue moving in > the > > > initial direction and add that particular configuration property. > > > Potentially, we can put more effort throughout an Ignite 3.0 timeframe > > and > > > remove the property altogether. @Valentin Kulichenko > > > , could you please suggest any alternate > > naming? > > > > > > Btw, what are the specifics of the issue with continuous queries? It > will > > > be ideal if we could release this new communication option in the GA > > state > > > in 2.9. > > > > > > - > > > Denis > > > > > > > > > On Wed, Jun 10, 2020 at 1:22 AM Sergey Chugunov < > > sergey.chugu...@gmail.com > > > > > > > wrote: > > > > > > > Denis, Val, > > > > > > > > Idea of prohibiting servers to open connections to clients and force > > > > clients to always open "inverse connections" to servers looks > > promising. > > > To > > > > be clear, by "inverse connections" I mean here that server needed to > > > > communicate with client requests client to open a connection back > > instead > > > > of opening connection by itself using addresses published by the > > client. > > > > > > > > If we apply the idea it will indeed allow us to simplify our > > > configuration > > > > (no need for new configuration property), another advantage is > clients > > > > won't need to publish their addresses anymore (with one side note > I'll > > > > cover at the end), it will also simplify our code. > > > > > > > > However applying it with current implementation of inverse connection > > > > request (when request goes across all ring) may bring significant > delay > > > of > > > > opening first connection depending on cluster size and relative > > positions > > > > between server that needs to communicate with client (target server) > > and > > > > client's router node. > > > > > > > > It is possible to overcome this by sending inverse connection request > > not > > > > via discovery but directly to router server node via communication > and > > > > convert to discovery message only on the router. We'll still have two > > > hops > > > > of communication request instead of one plus discovery worker on > client > > > may > > > > be busy working on other stuff slowing down handling of connection > > > request. > > > > But it should be fine. > > > > > > > > However with this solution it is hard to implement failover of router > > > node: > > > > let me describe it in more details. > > > > In case of router node failure target server won't be able to > determine > > > if > > > > client received inverse comm request successfully and (even worse) > > won't > > > be > > > > able to figure out new router for the client without waiting for > > > discovery > > > > event of the client reconnect. > > > > And this brings us to the following choise: we either wait for > > discovery > > > > event about
[jira] [Created] (IGNITE-13151) Checkpointer code refactoring
Sergey Chugunov created IGNITE-13151: Summary: Checkpointer code refactoring Key: IGNITE-13151 URL: https://issues.apache.org/jira/browse/IGNITE-13151 Project: Ignite Issue Type: Sub-task Components: persistence Reporter: Sergey Chugunov Checkpointer is at the center of Ignite persistence subsystem and more people from the community understand it the better means it is more stable and more efficient. However for now checkpointer code sits inside of GridCacheDatabaseSharedManager class and is entangled with this higher-level and more general component. To take a step forward to more modular checkpointer we need to do two things: # Move checkpointer code outside database manager to a separate class. # Create a well-defined API of checkpointer that will allow us to create new implementations of checkpointer in the future. An example of this is new checkpointer implementation needed for defragmentation feature purposes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13143) Persistent store defragmentation
Sergey Chugunov created IGNITE-13143: Summary: Persistent store defragmentation Key: IGNITE-13143 URL: https://issues.apache.org/jira/browse/IGNITE-13143 Project: Ignite Issue Type: New Feature Reporter: Sergey Chugunov Persistent store enables users to store data of their caches in a durable fashion on disk still benefiting from in-memory nature of Apache Ignite. Data of caches is stored in files created for every primary or backup partition assigned to that node and in an additional file for all user indexes. Files in filesystem are allocated lazily (only if some data is actually stored to particular partition) and grow automatically when more data is added to the cache. But the problem is that files cannot shrink even if all data is removed. This umbrella ticket covers all other tasks needed to implement simple yet effective approach to defragmentation. Detailed discussion could be found in [IEP-47|https://cwiki.apache.org/confluence/display/IGNITE/IEP-47%3A+Native+persistence+defragmentation] and in corresponding [dev-list discussion|http://apache-ignite-developers.2346864.n4.nabble.com/DISCUSSION-IEP-47-Native-persistence-defragmentation-td47717.html] but core ideas are as follows: # Defragmentation is performed in a special _maintenance_ mode when node starts, provides access to some APIs like metrics or JMX management but doesn't join the cluster. # It is performed by copying all data from all partitions on node to new files with automatic compaction. After successful copy old partition files are deleted. # Metrics on progress of the operation are provided to the user. # Operation is fault-tolerant and in case of node failure proceeds after node restart. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13141) Modify .NET counterpart of IgniteCluster to include functionality of Cluster ID and tag
Sergey Chugunov created IGNITE-13141: Summary: Modify .NET counterpart of IgniteCluster to include functionality of Cluster ID and tag Key: IGNITE-13141 URL: https://issues.apache.org/jira/browse/IGNITE-13141 Project: Ignite Issue Type: Task Reporter: Sergey Chugunov After implementation of IGNITE-12111 .NET tests showed broken API parity in new methods in IgniteCluster interface. We need to implement the same functionality on .NET side (see description in linked ticket). -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSSION] New Ignite settings for IGNITE-12438 and IGNITE-13013
Denis, Val, Idea of prohibiting servers to open connections to clients and force clients to always open "inverse connections" to servers looks promising. To be clear, by "inverse connections" I mean here that server needed to communicate with client requests client to open a connection back instead of opening connection by itself using addresses published by the client. If we apply the idea it will indeed allow us to simplify our configuration (no need for new configuration property), another advantage is clients won't need to publish their addresses anymore (with one side note I'll cover at the end), it will also simplify our code. However applying it with current implementation of inverse connection request (when request goes across all ring) may bring significant delay of opening first connection depending on cluster size and relative positions between server that needs to communicate with client (target server) and client's router node. It is possible to overcome this by sending inverse connection request not via discovery but directly to router server node via communication and convert to discovery message only on the router. We'll still have two hops of communication request instead of one plus discovery worker on client may be busy working on other stuff slowing down handling of connection request. But it should be fine. However with this solution it is hard to implement failover of router node: let me describe it in more details. In case of router node failure target server won't be able to determine if client received inverse comm request successfully and (even worse) won't be able to figure out new router for the client without waiting for discovery event of the client reconnect. And this brings us to the following choise: we either wait for discovery event about client reconnect (this is deadlock-prone as current protocol of CQ registration opens comm connection to client right from discovery thread in some cases; if we wait for new discovery event from discovery thread, it is a deadlock) or we fail opening the connection by timeout thus adding new scenarios when opening connection may fail. Thus implementing communication model "clients connect to servers, servers never connect to clients" efficiently requires more work on different parts of our functionality and rigorous testing of readiness of our code for more communication connection failures. So to me the least risky decision is not to delete new configuration but leave it with experimental status. If we find out that direct request (server -> router server -> target client) implementation works well and doesn't bring much complexity in failover scenarios we'll remove that configuration and prohibit servers to open connections to clients by default. Side note: there are rare but yet possible scenarios where client node needs to open communication connection to other client node. If we let clients not to publish their addresses these scenarios will stop working without additional logic like sending data through router node. As far as I know client-client connectivity is involved in p2p class deployment scenarios, does anyone know about other cases? -- Thanks, Sergey Chugunov On Wed, Jun 3, 2020 at 5:37 PM Denis Magda wrote: > Ivan, > > It feels like Val is driving us in the right direction. Is there any reason > for keeping the current logic when servers can open connections to clients? > > - > Denis > > > On Thu, May 21, 2020 at 4:48 PM Valentin Kulichenko < > valentin.kuliche...@gmail.com> wrote: > > > Ivan, > > > > Have you considered eliminating server to client connections altogether? > > Or, at the very least making the "client to server only" mode the default > > one? > > > > All the suggested names are confusing and not intuitive, and I doubt we > > will be able to find a good one. A server initiating a TCP connection > with > > a client is confusing in the first place and creates a usability issue. > We > > now want to solve it by introducing an additional configuration > > parameter, and therefore additional complexity. I don't think this is the > > right approach. > > > > What are the drawbacks of permanently switching to client-to-server > > connections? Is there any value provided by the server-to-client option? > > > > As for pair connections, I'm not sure I understand why there is a > > limitation. As far as I know, the idea behind this feature is that we > > maintain two connections between two nodes instead of one, so that every > > connection is used for communication in a single direction only. Why does > > it matter which node initiates the connection? Why can't one of the nodes > > (e.g., a client) initiate both connections, and then use them > accordingly? > > Correct me
Re: Question: network issues of single node.
Of course I meant ticket [1] increased cluster stability in situation of blinking network. [1] https://issues.apache.org/jira/browse/IGNITE-7163 On Mon, Jun 8, 2020 at 1:51 PM Sergey Chugunov wrote: > Vladimir, > > Adding to what Alexey has said I remember that cases of short-term network > issues (blinking network) were also a driver for this improvement. They are > indeed hard to reproduce but have been seen in real world set-ups and have > proven to increase cluster stability. > > On Sat, Jun 6, 2020 at 5:09 PM Denis Magda wrote: > >> Finally, I got your question. >> >> Back in 2017-2018, there was a Discovery SPI's stabilization activity. The >> networking component could fail in various hard-to-reproduce scenarios >> affecting cluster availability and consistency. That ticket reminds me of >> those notorious issues that would fire once a week or month under specific >> configuration settings. So, I would not touch the code that fixes the >> issue >> unless @Alexey Goncharuk or @Sergey Chugunov >> confirms that it's safe to do. Also, there >> should >> be a test for this scenario. >> >> - >> Denis >> >> >> On Fri, Jun 5, 2020 at 12:28 AM Vladimir Steshin >> wrote: >> >> > Denis, >> > >> > I have no nodes that I'm unable to interconnect. This case is simulated >> > in IgniteDiscoveryMassiveNodeFailTest.testMassiveFailSelfKill() >> > Introduced in [1]. >> > >> > I’m asking if it is real or supposed problem. Where it was met? Which >> > network configuration/issues could be? >> > >> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-7163 >> > >> > 05.06.2020 1:01, Denis Magda пишет: >> > > Vladimir, >> > > >> > > I'm suggesting to share the log files from the nodes that are unable >> to >> > > interconnect so that the community can check them for potential >> issues. >> > > Instead of sharing the logs from all the 5 nodes, try to start a >> > two-nodes >> > > cluster with the nodes that fail to discover each other and attach the >> > logs >> > > from those. >> > > >> > > - >> > > Denis >> > > >> > > >> > > On Thu, Jun 4, 2020 at 1:57 PM Vladimir Steshin >> > wrote: >> > > >> > >> Denis, hi. >> > >> >> > >> Sorry, I didn’t catch your idea. Are you saying this can happen >> > and >> > >> suggest experiment? I’m not descripting a probable case. It is >> already >> > >> done in [1]. I’m asking is it real, where it was met. >> > >> >> > >> >> > >> 04.06.2020 23:33, Denis Magda пишет: >> > >>> Vladimir, >> > >>> >> > >>> Please do the following experiment. Start a 2-nodes cluster booting >> > node >> > >> 3 >> > >>> and, for instance, node 5. Those won't be able to interconnect >> > according >> > >> to >> > >>> your description. Attach the log files from both nodes for analysis. >> > This >> > >>> should be a networking issue. >> > >>> >> > >>> - >> > >>> Denis >> > >>> >> > >>> >> > >>> On Thu, Jun 4, 2020 at 1:24 PM Vladimir Steshin > > >> > >> wrote: >> > >>>>Hi, Igniters. >> > >>>> >> > >>>> >> > >>>>I wanted to ask how one node may not be able to connect to >> > another >> > >>>> whereas rest of the cluster can. This got covered in [1]. In short: >> > node >> > >>>> 3 can't connect to nodes 4 and 5 but can to 1. At the same time, >> node >> > 2 >> > >>>> can connect to 4. Questions: >> > >>>> >> > >>>> 1) Is it real case? Where this problem came from? >> > >>>> >> > >>>> 2) If node 3 can’t connect to 4 and 5, does it mean node 2 can’t >> > connect >> > >>>> to 4 (and 5) too? >> > >>>> >> > >>>> Sergey, Dmitry maybe you bring light (I see you in [1])? I'm >> > >>>> participating in [2] and found this backward connection checking. >> > >>>> Answering would help us a lot. >> > >>>> >> > >>>> Thanks! >> > >>>> >> > >>>> [1] >> > >>>> https://issues.apache.org/jira/browse/IGNITE-7163< >> > >>>> https://issues.apache.org/jira/browse/IGNITE-7163> >> > >>>> >> > >>>> [2] >> > >>>> >> > >>>> >> > >> >> > >> https://cwiki.apache.org/confluence/display/IGNITE/IEP-45%3A+Crash+Recovery+Speed-Up >> > >>>> < >> > >>>> >> > >> >> > >> https://cwiki.apache.org/confluence/display/IGNITE/IEP-45%3A+Crash+Recovery+Speed-Up >> > >> >
Re: Question: network issues of single node.
Vladimir, Adding to what Alexey has said I remember that cases of short-term network issues (blinking network) were also a driver for this improvement. They are indeed hard to reproduce but have been seen in real world set-ups and have proven to increase cluster stability. On Sat, Jun 6, 2020 at 5:09 PM Denis Magda wrote: > Finally, I got your question. > > Back in 2017-2018, there was a Discovery SPI's stabilization activity. The > networking component could fail in various hard-to-reproduce scenarios > affecting cluster availability and consistency. That ticket reminds me of > those notorious issues that would fire once a week or month under specific > configuration settings. So, I would not touch the code that fixes the issue > unless @Alexey Goncharuk or @Sergey Chugunov > confirms that it's safe to do. Also, there should > be a test for this scenario. > > - > Denis > > > On Fri, Jun 5, 2020 at 12:28 AM Vladimir Steshin > wrote: > > > Denis, > > > > I have no nodes that I'm unable to interconnect. This case is simulated > > in IgniteDiscoveryMassiveNodeFailTest.testMassiveFailSelfKill() > > Introduced in [1]. > > > > I’m asking if it is real or supposed problem. Where it was met? Which > > network configuration/issues could be? > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-7163 > > > > 05.06.2020 1:01, Denis Magda пишет: > > > Vladimir, > > > > > > I'm suggesting to share the log files from the nodes that are unable to > > > interconnect so that the community can check them for potential issues. > > > Instead of sharing the logs from all the 5 nodes, try to start a > > two-nodes > > > cluster with the nodes that fail to discover each other and attach the > > logs > > > from those. > > > > > > - > > > Denis > > > > > > > > > On Thu, Jun 4, 2020 at 1:57 PM Vladimir Steshin > > wrote: > > > > > >> Denis, hi. > > >> > > >> Sorry, I didn’t catch your idea. Are you saying this can happen > > and > > >> suggest experiment? I’m not descripting a probable case. It is already > > >> done in [1]. I’m asking is it real, where it was met. > > >> > > >> > > >> 04.06.2020 23:33, Denis Magda пишет: > > >>> Vladimir, > > >>> > > >>> Please do the following experiment. Start a 2-nodes cluster booting > > node > > >> 3 > > >>> and, for instance, node 5. Those won't be able to interconnect > > according > > >> to > > >>> your description. Attach the log files from both nodes for analysis. > > This > > >>> should be a networking issue. > > >>> > > >>> - > > >>> Denis > > >>> > > >>> > > >>> On Thu, Jun 4, 2020 at 1:24 PM Vladimir Steshin > > >> wrote: > > >>>>Hi, Igniters. > > >>>> > > >>>> > > >>>>I wanted to ask how one node may not be able to connect to > > another > > >>>> whereas rest of the cluster can. This got covered in [1]. In short: > > node > > >>>> 3 can't connect to nodes 4 and 5 but can to 1. At the same time, > node > > 2 > > >>>> can connect to 4. Questions: > > >>>> > > >>>> 1) Is it real case? Where this problem came from? > > >>>> > > >>>> 2) If node 3 can’t connect to 4 and 5, does it mean node 2 can’t > > connect > > >>>> to 4 (and 5) too? > > >>>> > > >>>> Sergey, Dmitry maybe you bring light (I see you in [1])? I'm > > >>>> participating in [2] and found this backward connection checking. > > >>>> Answering would help us a lot. > > >>>> > > >>>> Thanks! > > >>>> > > >>>> [1] > > >>>> https://issues.apache.org/jira/browse/IGNITE-7163< > > >>>> https://issues.apache.org/jira/browse/IGNITE-7163> > > >>>> > > >>>> [2] > > >>>> > > >>>> > > >> > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-45%3A+Crash+Recovery+Speed-Up > > >>>> < > > >>>> > > >> > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-45%3A+Crash+Recovery+Speed-Up > > >
Re: [DISCUSSION] IEP-47 Native persistence defragmentation
Hi Ivan, I have an idea about suggested maintenance mode. First of all, I agree with your ideas about discovery restrictions: node should not join topology when performing defragmentation. At the same time I haven't heard about requests for this mode from users, so we don't know much about possible requirements. So I suggest to implement it in a pragmatical way: instead of inventing (unknown in reality) user scenarios lets develop minimal but yet well-designed functionality that suites our case. If we constrain our implementation with reasonable set of restrictions that's OK. So my idea is the following: to transit a node to maintenance user has to send special command to the node (e.g. with new command in control.sh utility or via JMX interface). Node saves maintenance request in local metastorage and waits for restart. User has to manually restart that node in order to finish moving it to maintenance mode. When node restarts and finds maintenance request it creates special type of discovery SPI that will not try to join topology at all yet node is able to start all necessary components and APIs like REST processor or JMX interface. When in maintenance, we'll be able to do defragmentation safely and remove maintenance request from metastorage only when it is completed (with all fault-tolerance logic in mind). As we don't have a mechanism (like watcher) to perform a "safe restart" (by safe I mean Ignite restart without OS-level process restart) we cannot finish maintenance mode without another manual restart but I think it is a reasonable restriction as maintenance mode shouldn't be an every-day routine and will be used quite rare. What do you think about it? On Tue, May 26, 2020 at 5:58 PM Ivan Bessonov wrote: > Hello Igniters, > > I'd like to discuss this new IEP with you: [1]. The main idea is to have a > procedure that relocates > pages to the top of the file as compact as possible which allows us to > trim the file and increase its > fill-factor. It will be configured manually and executed after the restart, > but before node joins > topology (it means any load would be impossible during defragmentation). It > is described in detail > in the IEP itself, please read it. This topic was also previously discussed > here on dev-list in [2]. > > Here I would like to list a few moments that are not as clear and require > your opinion. > > - what are your overall thoughts on the IEP? Any concerns? > > - maintenance mode - how do we communicate with the node that's not in > topology? What are >the options? As far as I know, we have no current tools like this. > > - checkpointer refactoring - these changes will involve intensive writing > of pages to the storage. >If we're going to reuse the offheap page model to perform > defragmentation then the >checkpointing mechanism will have to be adapted in some form. >Are you fine with this? Or we need a separate discussion? > > [1] > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-47%3A+Native+persistence+defragmentation > [2] > > http://apache-ignite-developers.2346864.n4.nabble.com/How-to-free-up-space-on-disc-after-removing-entries-from-IgniteCache-with-enabled-PDS-td39839.html > > > -- > Sincerely yours, > Ivan Bessonov >
[jira] [Created] (IGNITE-12878) Improve logging for async writing of Binary Metadata
Sergey Chugunov created IGNITE-12878: Summary: Improve logging for async writing of Binary Metadata Key: IGNITE-12878 URL: https://issues.apache.org/jira/browse/IGNITE-12878 Project: Ignite Issue Type: Task Reporter: Sergey Chugunov Assignee: Sergey Chugunov Fix For: 2.8.1 New implementation of writing binary metadata outside of discovery thread was introduced in IGNITE-12099 but sufficient debug logging was missing. To provide enough information in case of debugging we need to add necessary logging. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-12876) Test to cover deadlock fix between checkpoint, entry update and ttl-cleanup threads
Sergey Chugunov created IGNITE-12876: Summary: Test to cover deadlock fix between checkpoint, entry update and ttl-cleanup threads Key: IGNITE-12876 URL: https://issues.apache.org/jira/browse/IGNITE-12876 Project: Ignite Issue Type: Test Reporter: Sergey Chugunov Assignee: Sergey Chugunov Fix For: 2.8.1 IGNITE-12594 ticked fixed deadlock between several threads that was reproducible with low probability in unrelated tests. To improve test coverage of the fix new test dedicated for the deadlock situation is needed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: deadlock in system pool on meta update
Hello Sergey, Your analysis looks valid to me, we definitely need to investigate this deadlock and find out how to fix it. Could you create a ticket and write a test that reproduces the issue with sufficient probability? Thanks! On Mon, Mar 16, 2020 at 8:22 PM Sergey-A Kosarev wrote: > Classification: Public > > Hi, > I've recently tried to apply Ilya's idea ( > https://issues.apache.org/jira/browse/IGNITE-12663) of minimizing thread > pools and tried to set system pool to 3 in my own tests. > It caused deadlock on a client node and I think it can happen not only on > such small pool values. > > Details are following: > I'm not using persistence currently (if it matters). > On the client note I use ignite compute to call a job on every server > node (there are 3 server nodes in the tests). > > Then I've found in logs: > > [10:55:21] : [Step 1/1] [2020-03-13 10:55:21,773] { > grid-timeout-worker-#8} [WARN] [o.a.i.i.IgniteKernal] - Possible thread > pool starvation detected (no task completed in last 3ms, is system > thread pool size large enough?) > [10:55:21] : [Step 1/1] ^-- System thread pool [active=3, idle=0, > qSize=9] > > > I see in threaddumps that all 3 system pool workers do the same - > processing of job responses: > > "sys-#26" #605 daemon prio=5 os_prio=0 tid=0x64a0a800 nid=0x1f34 > waiting on condition [0x7b91d000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > at > java.util.concurrent.locks.LockSupport.park(LockSupport.java:304) > at > org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:177) > at > org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:140) > at > org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryProcessorImpl.metadata(CacheObjectBinaryProcessorImpl.java:749) > at > org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryProcessorImpl$1.metadata(CacheObjectBinaryProcessorImpl.java:250) > at > org.apache.ignite.internal.binary.BinaryContext.metadata(BinaryContext.java:1169) > at > org.apache.ignite.internal.binary.BinaryReaderExImpl.getOrCreateSchema(BinaryReaderExImpl.java:2005) > at > org.apache.ignite.internal.binary.BinaryReaderExImpl.(BinaryReaderExImpl.java:285) > at > org.apache.ignite.internal.binary.BinaryReaderExImpl.(BinaryReaderExImpl.java:184) > at > org.apache.ignite.internal.binary.BinaryUtils.doReadObject(BinaryUtils.java:1797) > at > org.apache.ignite.internal.binary.BinaryUtils.deserializeOrUnmarshal(BinaryUtils.java:2160) > at > org.apache.ignite.internal.binary.BinaryUtils.doReadCollection(BinaryUtils.java:2091) > at > org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize0(BinaryReaderExImpl.java:1914) > at > org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize(BinaryReaderExImpl.java:1714) > at > org.apache.ignite.internal.binary.BinaryReaderExImpl.readField(BinaryReaderExImpl.java:1982) > at > org.apache.ignite.internal.binary.BinaryFieldAccessor$DefaultFinalClassAccessor.read0(BinaryFieldAccessor.java:702) > at > org.apache.ignite.internal.binary.BinaryFieldAccessor.read(BinaryFieldAccessor.java:187) > at > org.apache.ignite.internal.binary.BinaryClassDescriptor.read(BinaryClassDescriptor.java:887) > at > org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize0(BinaryReaderExImpl.java:1762) > at > org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize(BinaryReaderExImpl.java:1714) > at > org.apache.ignite.internal.binary.BinaryUtils.doReadObject(BinaryUtils.java:1797) > at > org.apache.ignite.internal.binary.BinaryUtils.deserializeOrUnmarshal(BinaryUtils.java:2160) > at > org.apache.ignite.internal.binary.BinaryUtils.doReadCollection(BinaryUtils.java:2091) > at > org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize0(BinaryReaderExImpl.java:1914) > at > org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize(BinaryReaderExImpl.java:1714) > at > org.apache.ignite.internal.binary.GridBinaryMarshaller.deserialize(GridBinaryMarshaller.java:306) > at > org.apache.ignite.internal.binary.BinaryMarshaller.unmarshal0(BinaryMarshaller.java:100) > at > org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:80) > at > org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:10493) > at >
Re: MetaStorage key length limitations and Cache Metrics configuration
Ivan, I also don't think this issue is a blocker for 2.8 as it affects only experimental functionality and only in special cases. Removing key length limitations in MetaStorage seems more strategic approach to me but depending on how we decide to approach it (as a local fix or as part of a broader improvement of MetaStorage internal implementation) we may target it to 2.8.1 or 2.9. In the latter case it makes sense to implement key length validation [1] and include it to 2.8.1 to prevent user from making destructive actions. Otherwise if we decide to implement [2] earlier and remove this pesky limitation in 2.8.1 then I'm fine with closing [1] with "Won't fix" resolution. Does it make sense to you? [1] https://issues.apache.org/jira/browse/IGNITE-12721 [2] https://issues.apache.org/jira/browse/IGNITE-12726 On Fri, Feb 28, 2020 at 4:18 PM Maxim Muzafarov wrote: > Ivan, > > > This issue doesn't seem to be a blocker for 2.8 release from my point of > view. > I think we definitely will have such bugs in future and 2.8.1 is our > goal for them. > > Please, let me know if we should wait for the fix and include it exactly > in 2.8. > > On Fri, 28 Feb 2020 at 15:40, Nikolay Izhikov wrote: > > > > Igniters, > > > > I think we can replace cache name with the cache id. > > This should solve issue with the length limitation. > > > > What do you think? > > > > > 28 февр. 2020 г., в 15:32, Ivan Bessonov > написал(а): > > > > > > Hello Igniters, > > > > > > we have an issue in master branch and in the upcoming 2.8 release that > > > related to new metrics functionality implemented in [1]. You can't use > new > > > "configureHistogramMetric" and "configureHitRateMetric" configuration > > > methods on caches with long names. My estimation shows that cache with > 30 > > > characters in its name will shut down your whole cluster with failure > > > handler if > > > you try to change metrics configuration for it using one of those > methods. > > > > > > Initially we wanted to merge [2] to show a valid error message instead > of > > > failing > > > the cluster, but it wasn't in plans for 2.8 because we didn't know > that it > > > clashes > > > with [1]. > > > > > > I created issue [3] with plans of removing MetaStorage key length > > > limitations, but > > > it requires some thoughtful MetaStorageTree reworkings. I mean that it > > > can't be > > > done in only a few days. > > > > > > What do you think? Does this issue affect 2.8 release? AFAIK new > metrics are > > > experimental and they can have some known issues. Feel free to ask me > for > > > more > > > details if it's needed. > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11987 > > > [2] https://issues.apache.org/jira/browse/IGNITE-12721 > > > [3] https://issues.apache.org/jira/browse/IGNITE-12726 > > > > > > -- > > > Sincerely yours, > > > Ivan Bessonov > > >
[jira] [Created] (IGNITE-12721) Validation of key length written to Distributed Metastorage
Sergey Chugunov created IGNITE-12721: Summary: Validation of key length written to Distributed Metastorage Key: IGNITE-12721 URL: https://issues.apache.org/jira/browse/IGNITE-12721 Project: Ignite Issue Type: Task Reporter: Sergey Chugunov Assignee: Sergey Chugunov Fix For: 2.9 DistributedMetastorage functionality introduced in IGNITE-10640 provides convenient way to perform coordinated writes to local MetaStorages on all server nodes but lacks important part: validation of key length. Current implementation of MetaStorage doesn't allow keys longer than a specific value (64 bytes minus some prefixes, see source code for details) and throws assertion error on an attempt to write longer key. This error from MetaStorage is not propagated to DistributedMetastorage and (in theory) may even cause a node to halt. In order to avoid this situation validation of key length should be added right to DistributedMetastorage implementation to enforce "fail-fast" principle and preserve Ignite nodes from potentially dangerous consequences. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [VOTE] Allow or prohibit a joint use of @deprecated and @IgniteExperimental
-1 Prohibit. To me as a developer the situation when old but stable API is deprecated with only experimental (thus unstable/unfinished) alternative is very far from comfortable. And from outside folks it may look like as a sign of immature processes inside Ignite community (which is definitely not the case) and reduce overall users' impression. On Tue, Feb 11, 2020 at 2:20 PM Andrey Gura wrote: > -1 Prohibit > > On Mon, Feb 10, 2020 at 11:02 AM Alexey Goncharuk > wrote: > > > > Dear Apache Ignite community, > > > > We would like to conduct a formal vote on the subject of whether to allow > > or prohibit a joint existence of @deprecated annotation for an old API > > and @IgniteExperimental [1] for a new (replacement) API. The result of > this > > vote will be formalized as an Apache Ignite development rule to be used > in > > future. > > > > The discussion thread where you can address all non-vote messages is [2]. > > > > The votes are: > > *[+1 Allow]* Allow to deprecate the old APIs even when new APIs are > marked > > with @IgniteExperimental to explicitly notify users that an old APIs will > > be removed in the next major release AND new APIs are available. > > *[-1 Prohibit]* Never deprecate the old APIs unless the new APIs are > stable > > and released without @IgniteExperimental. The old APIs javadoc may be > > updated with a reference to new APIs to encourage users to evaluate new > > APIs. The deprecation and new API release may happen simultaneously if > the > > new API is not marked with @IgniteExperimental or the annotation is > removed > > in the same release. > > > > Neither of the choices prohibits deprecation of an API without a > > replacement if community decides so. > > > > The vote will hold for 72 hours and will end on February 13th 2020 08:00 > > UTC: > > > https://www.timeanddate.com/countdown/to?year=2020=2=13=8=0=0=utc-1 > > > > All votes count, there is no binding/non-binding status for this. > > > > [1] > > > https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/lang/IgniteExperimental.java > > [2] > > > http://apache-ignite-developers.2346864.n4.nabble.com/DISCUSS-Public-API-deprecation-rules-td45647.html > > > > Thanks, > > --AG >
[jira] [Created] (IGNITE-12646) When DEBUG mode is enabled GridToStringBuilder may throw java.util.ConcurrentModificationException
Sergey Chugunov created IGNITE-12646: Summary: When DEBUG mode is enabled GridToStringBuilder may throw java.util.ConcurrentModificationException Key: IGNITE-12646 URL: https://issues.apache.org/jira/browse/IGNITE-12646 Project: Ignite Issue Type: Bug Reporter: Sergey Chugunov Assignee: Sergey Chugunov Fix For: 2.9 With DEBUG enabled many components like CommunicationSPI start to log much larger chunks of information e.g. communication messages are logged as is. When big enough message with non-thread safe collection inside is logged by communication thread it is possible that some other thread started processing the same message. If processing involves modifying of the collection communication thread will get ConcurrentModificationException when in the middle of iterating over it. GridToStringBuilder should be safe from throwing this exception and (optionally) any type of RuntimeException. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: AWS EBS Discovery: Contributor Wanted
Denis, Emmanouil, Sure, I'll take a look at the code shortly. -- Thank you, Sergey. On Mon, Jan 27, 2020 at 8:59 PM Denis Magda wrote: > I support the idea of triggering such tests on demand. We can create a wiki > page with instructions on how to run the tests. Unless there is a more > elegant solution. > > Sergey, would you be able to review Emmanouil's changes in the IP Finder > source code? > https://issues.apache.org/jira/browse/IGNITE-8617 > > - > Denis > > > On Sun, Jan 26, 2020 at 2:22 AM Emmanouil Gkatziouras < > gkatzio...@gmail.com> > wrote: > > > Hi all! > > > > I do believe being able to execute some AWS integration tests on demand > > would be of value, especially for reviewers who cannot have an AWS stack > > created on demand. > > More than happy to help on that. > > > > Kind regards > > *Emmanouil Gkatziouras* > > https://egkatzioura.com/ | > > https://www.linkedin.com/in/gkatziourasemmanouil/ > > https://github.com/gkatzioura > > > > > > On Fri, 24 Jan 2020 at 15:15, Sergey Chugunov > > > wrote: > > > > > Hello Emmanouil, > > > > > > It would be great if we have at least basic integration tests in real > AWS > > > environment. Even though they may require more work to keep them green > (I > > > mean here AWS quotas and additional configuration/reconfiguration > > efforts) > > > it worth it because these tests can also be useful as an examples. > > > > > > As the same time as IpFinder is such a basic component I don't think we > > > need to include them in constantly triggered suites like Run All but to > > > trigger them manually before/right after merging them to master branch > or > > > when developing something in related code. > > > > > > What do you think? > > > > > > -- > > > Thank you, > > > Sergey Chugunov. > > > > > > On Thu, Jan 23, 2020 at 5:32 PM Emmanouil Gkatziouras < > > > gkatzio...@gmail.com> > > > wrote: > > > > > > > Hi all! > > > > > > > > Yes It seems possible to get some free quota for integration tests on > > AWS > > > > [1] however most probably they are not gonna last forever. > > > > > > > > [1] > > > > > > > > > > > > > > https://aws.amazon.com/blogs/opensource/aws-promotional-credits-open-source-projects/ > > > > > > > > King Regards > > > > *Emmanouil Gkatziouras* > > > > https://egkatzioura.com/ | > > > > https://www.linkedin.com/in/gkatziourasemmanouil/ > > > > https://github.com/gkatzioura > > > > > > > > > > > > On Wed, 22 Jan 2020 at 16:48, Denis Magda wrote: > > > > > > > > > Hi Emmanouil, > > > > > > > > > > Thanks for preparing a pull-request for Application Load Balancer: > > > > > https://issues.apache.org/jira/browse/IGNITE-8617 > > > > > > > > > > Igniters, who is willing to step in as a primary reviewer? > > > > > > > > > > As for automated testing on AWS, are you aware of any sponsorship > > > program > > > > > of AWS for open source projects of our kind? It will be ideal to > have > > > > real > > > > > testing on AWS but someone needs to pay. > > > > > > > > > > - > > > > > Denis > > > > > > > > > > > > > > > On Sun, Jan 19, 2020 at 6:45 AM Emmanouil Gkatziouras < > > > > > gkatzio...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hi all! > > > > > > > > > > > > I have spinned up an Application Load Balancer and an autoscaling > > > group > > > > > on > > > > > > AWS and the Ignite discovery using TcpDiscoveryAlbIpFinder works > as > > > > > > expected. > > > > > > > > > > > >- On startup nodes discover each other. > > > > > >- On ec2 node down, connection is lost and the cluster > > decreases. > > > > > >- On an extra node addition the cluster size increases > > > > > > > > > > > > This contribution is essential since the Previous ELB based > > discovery > > > > > uses > > > > > > the Classic Load Balancer which is still available h
Re: AWS EBS Discovery: Contributor Wanted
Hello Emmanouil, It would be great if we have at least basic integration tests in real AWS environment. Even though they may require more work to keep them green (I mean here AWS quotas and additional configuration/reconfiguration efforts) it worth it because these tests can also be useful as an examples. As the same time as IpFinder is such a basic component I don't think we need to include them in constantly triggered suites like Run All but to trigger them manually before/right after merging them to master branch or when developing something in related code. What do you think? -- Thank you, Sergey Chugunov. On Thu, Jan 23, 2020 at 5:32 PM Emmanouil Gkatziouras wrote: > Hi all! > > Yes It seems possible to get some free quota for integration tests on AWS > [1] however most probably they are not gonna last forever. > > [1] > > https://aws.amazon.com/blogs/opensource/aws-promotional-credits-open-source-projects/ > > King Regards > *Emmanouil Gkatziouras* > https://egkatzioura.com/ | > https://www.linkedin.com/in/gkatziourasemmanouil/ > https://github.com/gkatzioura > > > On Wed, 22 Jan 2020 at 16:48, Denis Magda wrote: > > > Hi Emmanouil, > > > > Thanks for preparing a pull-request for Application Load Balancer: > > https://issues.apache.org/jira/browse/IGNITE-8617 > > > > Igniters, who is willing to step in as a primary reviewer? > > > > As for automated testing on AWS, are you aware of any sponsorship program > > of AWS for open source projects of our kind? It will be ideal to have > real > > testing on AWS but someone needs to pay. > > > > - > > Denis > > > > > > On Sun, Jan 19, 2020 at 6:45 AM Emmanouil Gkatziouras < > > gkatzio...@gmail.com> > > wrote: > > > > > Hi all! > > > > > > I have spinned up an Application Load Balancer and an autoscaling group > > on > > > AWS and the Ignite discovery using TcpDiscoveryAlbIpFinder works as > > > expected. > > > > > >- On startup nodes discover each other. > > >- On ec2 node down, connection is lost and the cluster decreases. > > >- On an extra node addition the cluster size increases > > > > > > This contribution is essential since the Previous ELB based discovery > > uses > > > the Classic Load Balancer which is still available however > > > AWS advices users to use the Application one. [1] > > > While my pull request gets reviewed I will also have a look at > > > the IGNITE-12398 [2] issue which has to do with the S3 discovery. > > > Another idea would also be to implement a `TCP Load Balancer based` > > > discovery. > > > > > > In order to test this issue and future ones I implemented some > terraform > > > scripts (which I shall use for other issues too) [3]. > > > If some automated e2e testing on AWS is being considered they might be > of > > > value. > > > I can help on implementing those tests by provisioning the > infrastructure > > > in an automated way and validate the discovery. > > > > > > [1] > > > > > > > > > https://docs.aws.amazon.com/elasticloadbalancing/latest/userguide/migrate-to-application-load-balancer.html > > > [2] https://issues.apache.org/jira/browse/IGNITE-12398 > > > [3] https://github.com/gkatzioura/ignite-aws-deploy > > > > > > Kind regards, > > > *Emmanouil Gkatziouras* > > > https://egkatzioura.com/ | > > > https://www.linkedin.com/in/gkatziourasemmanouil/ > > > https://github.com/gkatzioura > > > > > > > > > On Tue, 14 Jan 2020 at 22:22, Denis Magda wrote: > > > > > > > Hi Emmanouil, > > > > > > > > Agree, let's check that the IP finder functions normally in the cloud > > > > environment and the mock tests can be used for regular testing on > Team > > > > City. That's the way we tested other environment-specific IP finders > > > > including the Kubernetes one. > > > > > > > > Let us know once the IP finder is tested on AWS and then we can > proceed > > > > with the review. > > > > > > > > - > > > > Denis > > > > > > > > > > > > On Tue, Jan 14, 2020 at 2:47 AM Emmanouil Gkatziouras < > > > > gkatzio...@gmail.com> > > > > wrote: > > > > > > > > > Hi all! > > > > > > > > > > With regards to the `Node Discovery Using AWS Application ELB` > issue > > > [1] > >
New blog post on Apache Ignite in AWS
Hello community, Recently I published a new blog post on getting started with Apache Ignite in AWS [1]. I tried to make my example as simple as possible while keeping it usable. Let me know if this post is useful for you. I plan to write several follow-up posts about AWS-specific things but based on feedback may cover other topics in more detail. Any feedback is welcome, thank you! [1] https://www.gridgain.com/node/6247
[jira] [Created] (IGNITE-12439) More descriptive message in situation of IgniteOutOfMemoryException, warning message if risk of IOOME is found
Sergey Chugunov created IGNITE-12439: Summary: More descriptive message in situation of IgniteOutOfMemoryException, warning message if risk of IOOME is found Key: IGNITE-12439 URL: https://issues.apache.org/jira/browse/IGNITE-12439 Project: Ignite Issue Type: Improvement Reporter: Sergey Chugunov Assignee: Sergey Chugunov Fix For: 2.9 In persistent mode starting many caches in a data region of a small size may lead to IgniteOutOfMemoryException being thrown. The root cause is that each partition requires allocation of one or more metapages that should be stored during checkpoint and cannot be replaced by other types of pages. As a result when too many metapages occupy significant portion of data region's space a request to replace a page in memory (with one on disk) may not be able to find clean page for replacement. In this situation IgniteOutOfMemoryException is thrown. It is not easy to prevent IOOME in general case, but we should provide more descriptive message when the exception is thrown and/or print out warning to logs when too many caches (or one cache with huge number of partitions) are started in the same data region. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: IgniteOutOfMemoryException in LOCAL cache mode with persistence enabled
Hi Mitchell, I believe that research done by Anton is correct, and the root cause of the OOME is proportion of memory occupied by metapages in data region. Each cache started in data region allocates one or more metapages per initialized partition so when you run your test with only one cache this is not a problem, but when second cache is added this results in OOME. I don't think there is an easy way to prevent this exception in general but I agree that we need to provide more descriptive error message and/or early warning for the user that configuration of caches and data regions may lead to such exception. I'll file a ticket for this improvement soon. Best regards, Sergey On Thu, Dec 12, 2019 at 1:27 AM Denis Magda wrote: > I tend to agree with Mitchell that the cluster should not crash. If the > crash is unavoidable based on the current architecture then a message > should be more descriptive. > > Ignite persistence experts, could you please join the conversation and > shed more light to the reported behavior? > > - > Denis > > > On Wed, Dec 11, 2019 at 3:25 AM Mitchell Rathbun (BLOOMBERG/ 731 LEX) < > mrathb...@bloomberg.net> wrote: > >> 2 GB is not reasonable for off heap memory for our use case. In general, >> even if off-heap is very low, performance should just degrade and calls >> should become blocking, I don't think that we should crash. Either way, the >> issue seems to be with putAll, not concurrent updates of different caches >> in the same data region. If I use Ignite's DataStreamer API instead of >> putAll, I get much better performance and no OOM exception. Any insight >> into why this might be would be appreciated. >> >> From: u...@ignite.apache.org At: 12/10/19 11:24:35 >> To: Mitchell Rathbun (BLOOMBERG/ 731 LEX ) , >> u...@ignite.apache.org >> Subject: Re: IgniteOutOfMemoryException in LOCAL cache mode with >> persistence enabled >> >> Hello! >> >> 10M is very very low-ball for testing performance of disk, considering >> how Ignite's wal/checkpoints are structured. As already told, it does not >> even work properly. >> >> I recommend using 2G value instead. Just load enough data so that you can >> observe constant checkpoints. >> >> Regards, >> -- >> Ilya Kasnacheev >> >> >> ср, 4 дек. 2019 г. в 03:16, Mitchell Rathbun (BLOOMBERG/ 731 LEX) < >> mrathb...@bloomberg.net>: >> >>> For the requested full ignite log, where would this be found if we are >>> running using local mode? We are not explicitly running a separate ignite >>> node, and our WorkDirectory does not seem to have any logs >>> >>> From: u...@ignite.apache.org At: 12/03/19 19:00:18 >>> To: u...@ignite.apache.org >>> Subject: Re: IgniteOutOfMemoryException in LOCAL cache mode with >>> persistence enabled >>> >>> For our configuration properties, our DataRegion initialSize and MaxSize >>> was set to 11 MB and persistence was enabled. For DataStorage, our pageSize >>> was set to 8192 instead of 4096. For Cache, write behind is disabled, on >>> heap cache is disabled, and Atomicity Mode is Atomic >>> >>> From: u...@ignite.apache.org At: 12/03/19 13:40:32 >>> To: u...@ignite.apache.org >>> Subject: Re: IgniteOutOfMemoryException in LOCAL cache mode with >>> persistence enabled >>> >>> Hi Mitchell, >>> >>> Looks like it could be easily reproduced on low off-heap sizes, I tried >>> with >>> simple puts and got the same exception: >>> >>> class org.apache.ignite.internal.mem.IgniteOutOfMemoryException: Failed >>> to >>> find a page for eviction [segmentCapacity=1580, loaded=619, >>> maxDirtyPages=465, dirtyPages=619, cpPages=0, pinnedInSegment=0, >>> failedToPrepare=620] >>> Out of memory in data region [name=Default_Region, initSize=10.0 MiB, >>> maxSize=10.0 MiB, persistenceEnabled=true] Try the following: >>> ^-- Increase maximum off-heap memory size >>> (DataRegionConfiguration.maxSize) >>> ^-- Enable Ignite persistence >>> (DataRegionConfiguration.persistenceEnabled) >>> ^-- Enable eviction or expiration policies >>> >>> It looks like Ignite must issue a proper warning in this case and couple >>> of >>> issues must be filed against Ignite JIRA. >>> >>> Check out this article on persistent store available in Ignite >>> confluence as >>> well: >>> >>> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+und >>> er+the+hood#IgnitePersistentStore-underthehood-Checkpointing >>> >>> I've managed to make kind of similar example working with 20 Mb region >>> with >>> a bit of tuning, added following properties to >>> org.apache.ignite.configuration.DataStorageConfiguration: >>> / >>> / >>> >>> The whole idea behind this is to trigger checkpoint on timeout rather >>> than >>> on too much dirty pages percentage threshold. The checkpoint page buffer >>> size may not exceed data region size, which is 10 Mb, which might be >>> overflown during checkpoint as well. >>> >>> I assume that checkpoint is never triggered in this case because of >>> per-partition overhead: Ignite writes some meta per partition and it
Contribution to Apache Ignite
Hello Lev, My name is Sergey, I'm from Apache Ignite community. As I can see you successfully completed your first ticket but there are some review comments on your second one [1]. Do you need any assistance with resolving them? Also if you're interested in more challenging tasks, there are plenty of them and we could figure out what to pick up next. Anyway, thank you for your interest to our project and community! [1] https://issues.apache.org/jira/browse/IGNITE-11312 -- Best regards, Sergey Chugunov.
Re: Binary object format KB article
Then I would suggest to define good terminology at the very beginning of the article. Right in introduction section I see a lot of terms like "Binary object format", "Binary object container format" (is it the same thing?), "Binary serialization format". In the next section "binary type" pops up. What are the relations between them? Schemes part needs more examples. What is scheme? How it is related to binary type? Is it a one-to-one relationship? One-to-many? When a new scheme is created? Why type and scheme should be registered on a receiver side? And if the receiver exists then who is the sender? It seems to me that document tries to focus on details of the format itself but other aspects of this functionality leak into the explanation and confuses reader. On Wed, Oct 16, 2019 at 2:52 PM Ivan Pavlukhin wrote: > Pavel, Sergey, > > Thank you for your feedback! > > To be exact the document does not describe broad picture (including > metadata exchange) and is not a formal format specification > intentionally. I wanted to create a lightweight article giving an > intuition about binary object structure to a reader. And yes, > intuition about metadata registration is definitely an important, > related but slightly different subject. > > ср, 16 окт. 2019 г. в 14:23, Sergey Chugunov : > > > > Ivan, thank you for documenting this functionality, agree with Pavel > here. > > > > I think this document is a good starting point and contains a lot of > > low-level details and great examples but from my perspective it doesn't > > show how binary objects fit into a broader picture. > > > > It worth adding higher-level details and restructure the document into a > > top-down article starting from where binary format is used > (representation > > of objects in cache, binary protocol for communications with thin > clients) > > and down to lower details like binary metadata exchange and serialization > > and container formats. > > > > Another option would be to leave the document focused on a low-level > > details as it is now but build around it drafts for documents describing > > other aspects of Binary Objects. > > This will make our documentation much more solid and useful for readers. > > > > On Wed, Oct 16, 2019 at 2:07 PM Pavel Tupitsyn > wrote: > > > > > Ivan, great job, thanks for putting this together. > > > > > > I think we also need a more formal description of the format, including > > > binary metadata exchange mechanics. > > > It was done (partially) for IEP-9 Thin Client Protocol, we should > probably > > > copy from there: > > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-9+Thin+Client+Protocol#IEP-9ThinClientProtocol-BinaryObjects > > > > > > > > > > > > On Wed, Oct 16, 2019 at 11:49 AM Ivan Pavlukhin > > > wrote: > > > > > > > Igniters, > > > > > > > > I published a document about Binary format in cwiki [1]. Please share > > > > your feedback. I feel that there is a lack of pictures on the page. > > > > Need to figure out what aspects will be more clear with pictures. > > > > > > > > [1] > > > > > https://cwiki.apache.org/confluence/display/IGNITE/Binary+object+format > > > > > > > > -- > > > > Best regards, > > > > Ivan Pavlukhin > > > > > > > > > > > -- > Best regards, > Ivan Pavlukhin >
Re: Binary object format KB article
Ivan, thank you for documenting this functionality, agree with Pavel here. I think this document is a good starting point and contains a lot of low-level details and great examples but from my perspective it doesn't show how binary objects fit into a broader picture. It worth adding higher-level details and restructure the document into a top-down article starting from where binary format is used (representation of objects in cache, binary protocol for communications with thin clients) and down to lower details like binary metadata exchange and serialization and container formats. Another option would be to leave the document focused on a low-level details as it is now but build around it drafts for documents describing other aspects of Binary Objects. This will make our documentation much more solid and useful for readers. On Wed, Oct 16, 2019 at 2:07 PM Pavel Tupitsyn wrote: > Ivan, great job, thanks for putting this together. > > I think we also need a more formal description of the format, including > binary metadata exchange mechanics. > It was done (partially) for IEP-9 Thin Client Protocol, we should probably > copy from there: > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-9+Thin+Client+Protocol#IEP-9ThinClientProtocol-BinaryObjects > > > > On Wed, Oct 16, 2019 at 11:49 AM Ivan Pavlukhin > wrote: > > > Igniters, > > > > I published a document about Binary format in cwiki [1]. Please share > > your feedback. I feel that there is a lack of pictures on the page. > > Need to figure out what aspects will be more clear with pictures. > > > > [1] > > https://cwiki.apache.org/confluence/display/IGNITE/Binary+object+format > > > > -- > > Best regards, > > Ivan Pavlukhin > > >
Cluster ID and tag: identification of the cluster
Hello folks, I would like to propose implementing new properties to identify the cluster and simplify managing it by external tools (e.g. custom scripts built on top of standard Control Utility). These properties are Cluster ID (UUID) and Cluster tag (String) exposed through existing IgniteCluster public API. Both properties are generated upon cluster startup (before activation if the cluster requires it) and survived restarts if cluster is configured in persistent mode and regenerated again if it is an in-memory cluster. Cluster ID is immutable and useful more for automated tools while Cluster tag is mutable (may be changed by user) and is supposed to be human readable to view it in any GUI or web-based management solutions. I already created a ticket [1] with some more technical details and invite community to discuss this feature. [1] https://issues.apache.org/jira/browse/IGNITE-12111 -- Best Regards, Sergey Chugunov.
[jira] [Created] (IGNITE-12111) Cluster ID and tag: properties to identify the cluster
Sergey Chugunov created IGNITE-12111: Summary: Cluster ID and tag: properties to identify the cluster Key: IGNITE-12111 URL: https://issues.apache.org/jira/browse/IGNITE-12111 Project: Ignite Issue Type: New Feature Reporter: Sergey Chugunov Assignee: Sergey Chugunov Fix For: 2.8 To improve cluster management capabilities two new properties of the cluster are introduced: # A unique cluster ID (may be either a random UUID or random IgniteUUID). Generated upon cluster start and saved to the distributed metastorage. Immutable. Persistent clusters must persist the value. In-memory clusters should keep the generated ID in memory and regenerate it upon restart. # Human-readable cluster tag. Generated by default to some random (but still meaningful) string, may be changed by user. Again survives restart in persistent clusters and is regenerated in in-memory clusters on every restart. These properties are exposed to standard APIs: # EVT_CLUSTER_TAG_CHANGED event generated when tag is changed by user; # JMX bean and control utility APIs to view ID and tag and change tag. -- This message was sent by Atlassian Jira (v8.3.2#803003)
Re: Asynchronous registration of binary metadata
ested. > We > > do > > > > not > > > > > need another copy/paste code. > > > > > > > > > > Another possibility is to carry metadata along with appropriate > > request > > > > if > > > > > it's not found locally but this is a rather big modification. > > > > > > > > > > > > > > > > > > > > вт, 20 авг. 2019 г. в 17:26, Denis Mekhanikov < > dmekhani...@gmail.com > > >: > > > > > > > > > > > Eduard, > > > > > > > > > > > > Usages will wait for the metadata to be registered and written to > > disk. > > > > No > > > > > > races should occur with such flow. > > > > > > Or do you have some specific case on your mind? > > > > > > > > > > > > I agree, that using a distributed meta storage would be nice > here. > > > > > > But this way we will kind of move to the previous scheme with a > > > > replicated > > > > > > system cache, where metadata was stored before. > > > > > > Will scheme with the metastorage be different in any way? Won’t > we > > > > decide > > > > > > to move back to discovery messages again after a while? > > > > > > > > > > > > Denis > > > > > > > > > > > > > > > > > > > On 20 Aug 2019, at 15:13, Eduard Shangareev < > > > > eduard.shangar...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > Denis, > > > > > > > How would we deal with races between registration and metadata > > usages > > > > > > with > > > > > > > such fast-fix? > > > > > > > > > > > > > > I believe, that we need to move it to distributed metastorage, > > and > > > > await > > > > > > > registration completeness if we can't find it (wait for work in > > > > > > progress). > > > > > > > Discovery shouldn't wait for anything here. > > > > > > > > > > > > > > On Tue, Aug 20, 2019 at 11:55 AM Denis Mekhanikov < > > > > dmekhani...@gmail.com > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Sergey, > > > > > > > > > > > > > > > > Currently metadata is written to disk sequentially on every > > node. Only > > > > > > one > > > > > > > > node at a time is able to write metadata to its storage. > > > > > > > > Slowness accumulates when you add more nodes. A delay > required > > to > > > > write > > > > > > > > one piece of metadata may be not that big, but if you > multiply > > it by > > > > say > > > > > > > > 200, then it becomes noticeable. > > > > > > > > But If we move the writing out from discovery threads, then > > nodes will > > > > > > be > > > > > > > > doing it in parallel. > > > > > > > > > > > > > > > > I think, it’s better to block some threads from a striped > pool > > for a > > > > > > > > little while rather than blocking discovery for the same > > period, but > > > > > > > > multiplied by a number of nodes. > > > > > > > > > > > > > > > > What do you think? > > > > > > > > > > > > > > > > Denis > > > > > > > > > > > > > > > > > On 15 Aug 2019, at 10:26, Sergey Chugunov < > > sergey.chugu...@gmail.com > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > Denis, > > > > > > > > > > > > > > > > > > Thanks for bringing this issue up, decision to write binary > > metadata > > > > > > from > > > > > > > > > discovery thread was really a tough decision to make. > > > > > > > > > I don't think that moving metadata to metastorage is a > > silver bullet > > > > > > here > > > > >
Re: Re[2]: Asynchronous registration of binary metadata
Denis, Thanks for bringing this issue up, decision to write binary metadata from discovery thread was really a tough decision to make. I don't think that moving metadata to metastorage is a silver bullet here as this approach also has its drawbacks and is not an easy change. In addition to workarounds suggested by Alexei we have two choices to offload write operation from discovery thread: 1. Your scheme with a separate writer thread and futures completed when write operation is finished. 2. PME-like protocol with obvious complications like failover and asynchronous wait for replies over communication layer. Your suggestion looks easier from code complexity perspective but in my view it increases chances to get into starvation. Now if some node faces really long delays during write op it is gonna be kicked out of topology by discovery protocol. In your case it is possible that more and more threads from other pools may stuck waiting on the operation future, it is also not good. What do you think? I also think that if we want to approach this issue systematically, we need to do a deep analysis of metastorage option as well and to finally choose which road we wanna go. Thanks! On Thu, Aug 15, 2019 at 9:28 AM Zhenya Stanilovsky wrote: > > > > >> 1. Yes, only on OS failures. In such case data will be received from > alive > >> nodes later. > What behavior would be in case of one node ? I suppose someone can obtain > cache data without unmarshalling schema, what in this case would be with > grid operability? > > > > >> 2. Yes, for walmode=FSYNC writes to metastore will be slow. But such > mode > >> should not be used if you have more than two nodes in grid because it > has > >> huge impact on performance. > Is wal mode affects metadata store ? > > > > >> > >> ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov < dmekhani...@gmail.com > >: > >> > >>> Folks, > >>> > >>> Thanks for showing interest in this issue! > >>> > >>> Alexey, > >>> > I think removing fsync could help to mitigate performance issues with > >>> current implementation > >>> > >>> Is my understanding correct, that if we remove fsync, then discovery > won’t > >>> be blocked, and data will be flushed to disk in background, and loss of > >>> information will be possible only on OS failure? It sounds like an > >>> acceptable workaround to me. > >>> > >>> Will moving metadata to metastore actually resolve this issue? Please > >>> correct me if I’m wrong, but we will still need to write the > information to > >>> WAL before releasing the discovery thread. If WAL mode is FSYNC, then > the > >>> issue will still be there. Or is it planned to abandon the > discovery-based > >>> protocol at all? > >>> > >>> Evgeniy, Ivan, > >>> > >>> In my particular case the data wasn’t too big. It was a slow > virtualised > >>> disk with encryption, that made operations slow. Given that there are > 200 > >>> nodes in a cluster, where every node writes slowly, and this process is > >>> sequential, one piece of metadata is registered extremely slowly. > >>> > >>> Ivan, answering to your other questions: > >>> > 2. Do we need a persistent metadata for in-memory caches? Or is it so > >>> accidentally? > >>> > >>> It should be checked, if it’s safe to stop writing marshaller mappings > to > >>> disk without loosing any guarantees. > >>> But anyway, I would like to have a property, that would control this. > If > >>> metadata registration is slow, then initial cluster warmup may take a > >>> while. So, if we preserve metadata on disk, then we will need to warm > it up > >>> only once, and further restarts won’t be affected. > >>> > Do we really need a fast fix here? > >>> > >>> I would like a fix, that could be implemented now, since the activity > with > >>> moving metadata to metastore doesn’t sound like a quick one. Having a > >>> temporary solution would be nice. > >>> > >>> Denis > >>> > On 14 Aug 2019, at 11:53, Павлухин Иван < vololo...@gmail.com > > wrote: > > Denis, > > Several clarifying questions: > 1. Do you have an idea why metadata registration takes so long? So > poor disks? So many data to write? A contention with disk writes by > other subsystems? > 2. Do we need a persistent metadata for in-memory caches? Or is it so > accidentally? > > Generally, I think that it is possible to move metadata saving > operations out of discovery thread without loosing required > consistency/integrity. > > As Alex mentioned using metastore looks like a better solution. Do we > really need a fast fix here? (Are we talking about fast fix?) > > ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky > >>> < arzamas...@mail.ru.invalid >: > > > > Alexey, but in this case customer need to be informed, that whole > (for > >>> example 1 node) cluster crash (power off) could lead to partial data > >>> unavailability. > > And may be further index corruption.
[jira] [Created] (IGNITE-11952) Bug fixes and improvements in console utilities & test fixes
Sergey Chugunov created IGNITE-11952: Summary: Bug fixes and improvements in console utilities & test fixes Key: IGNITE-11952 URL: https://issues.apache.org/jira/browse/IGNITE-11952 Project: Ignite Issue Type: Bug Reporter: Sergey Chugunov Assignee: Sergey Chugunov Fix For: 2.8 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-11865) FailureProcessor treats tcp-comm-worker as blocked when it works on reestablishing connect to failed client node
Sergey Chugunov created IGNITE-11865: Summary: FailureProcessor treats tcp-comm-worker as blocked when it works on reestablishing connect to failed client node Key: IGNITE-11865 URL: https://issues.apache.org/jira/browse/IGNITE-11865 Project: Ignite Issue Type: Bug Affects Versions: 2.7 Reporter: Sergey Chugunov Assignee: Sergey Chugunov Fix For: 2.8 When client node fails tcp-comm-worker thread on server keeps trying to reestablish connection to the client until failed node is removed from topology (on expiration of clientFailureDetectionTimeout). As tcp-comm-worker thread doesn't update its heartbeats from internal loops FailureProcessor considers it as blocked and prints out misleading message to logs along with full thread dump. To avoid polluting logs with unnecessary messages we need to teach tcp-comm-worker how to update its heartbeat timestamp in FailureProcessor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-11743) Stopping caches concurrently with node join may lead to crash of the node
Sergey Chugunov created IGNITE-11743: Summary: Stopping caches concurrently with node join may lead to crash of the node Key: IGNITE-11743 URL: https://issues.apache.org/jira/browse/IGNITE-11743 Project: Ignite Issue Type: Bug Affects Versions: 2.7 Reporter: Sergey Chugunov Assignee: Sergey Chugunov Attachments: IgnitePdsNodeRestartCacheCreateTest.java When an existing cache is stopped (e.g. via call Ignite#destroyCache(String name)) this action is distributed across cluster by discovery mechanism (and is processed from *disco-notifier-worker* thread). At the same time joining node prepares to start caches from *exchange-thread*. If a cache stop request arrives to new node right in the middle of cache start prepare, it may lead to exception in FilePageStoreManager like one below and node crash. Test reproducing the issue is attached. {noformat} class org.apache.ignite.IgniteCheckedException: Failed to get page store for the given cache ID (cache has not been started): -1422502786 at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.getStore(FilePageStoreManager.java:1132) at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.read(FilePageStoreManager.java:482) at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.read(FilePageStoreManager.java:469) at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:854) at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:681) at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.getOrAllocateCacheMetas(GridCacheOffheapManager.java:869) at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.initDataStructures(GridCacheOffheapManager.java:128) at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.start(IgniteCacheOffheapManagerImpl.java:193) at org.apache.ignite.internal.processors.cache.CacheGroupContext.start(CacheGroupContext.java:1043) at org.apache.ignite.internal.processors.cache.GridCacheProcessor.startCacheGroup(GridCacheProcessor.java:2829) at org.apache.ignite.internal.processors.cache.GridCacheProcessor.getOrCreateCacheGroupContext(GridCacheProcessor.java:2557) at org.apache.ignite.internal.processors.cache.GridCacheProcessor.prepareCacheContext(GridCacheProcessor.java:2387) at org.apache.ignite.internal.processors.cache.GridCacheProcessor.lambda$null$6a5b31b9$1(GridCacheProcessor.java:2209) at org.apache.ignite.internal.processors.cache.GridCacheProcessor.lambda$prepareStartCaches$5(GridCacheProcessor.java:2130) at org.apache.ignite.internal.processors.cache.GridCacheProcessor.lambda$prepareStartCaches$926b6886$1(GridCacheProcessor.java:2206) at org.apache.ignite.internal.util.IgniteUtils.lambda$null$1(IgniteUtils.java:10874) at java.util.concurrent.FutureTask.run(FutureTask.java:266) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-11739) Refactoring of cache lifecycle and cache configuration management code
Sergey Chugunov created IGNITE-11739: Summary: Refactoring of cache lifecycle and cache configuration management code Key: IGNITE-11739 URL: https://issues.apache.org/jira/browse/IGNITE-11739 Project: Ignite Issue Type: Task Components: cache Reporter: Sergey Chugunov h2. Problem Currently code responsible for cache lifecycle and configuration management is spread across different entities (e.g. GridCacheProcessor, GridCacheAffinityManager, ClusterCachesInfo and so on). Cache configuration data is duplicated multiple times and presented in different forms from StoredCacheData to CacheGroupDescriptor to DynamicCacheDescriptor to ClusterCachesInfo. Altogether there is no entity nor abstraction which contains most of the logic of managing cache state and config and provides a clean API for this purpose. All this makes it hard to maintain the code, fix bugs and make improvements so need for refactoring and benefits from it are obvious. h2. Approaches # Build state machine manipulating immutable state objects to reflect transitions between states. # Concentrate all cache-related info into a new (abstraction like cache container) or existing (e.g. cache context) mutable entity and manipulate that entity to reflect evolution of cache. # Some mix of these two approaches. There are already plenty of entities like CacheInfo or CacheDescriptor with names suggesting they contain information about cache. The problem though is that each of these entities manages only some part of information. Regardless of which approach is used clear and well documented API should be provided for managing lifecycle and configuration. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-11621) Node is stuck in "No next node in topology" infinite loop in special case.
Sergey Chugunov created IGNITE-11621: Summary: Node is stuck in "No next node in topology" infinite loop in special case. Key: IGNITE-11621 URL: https://issues.apache.org/jira/browse/IGNITE-11621 Project: Ignite Issue Type: Bug Reporter: Sergey Chugunov Assignee: Sergey Chugunov Attachments: NoNextNodeInTopologyReproducer.java In special case (reproducer is attached) node may stuck in the loop when the following sequence of events happens: * Nodes A and B are already in cluster. * Node C starts joining the cluster. * On node C NodeAdded message new node D is started. * Before NodeAddFinished for node C reaches it socket to node C fails and node is considered failed by the cluster. * When NodeFailed message for node C reaches node B both A and B fails. * After that node D gets stuck in infinite "No next node in topology" loop processing NodeFailed messages for A, B and C indefinitely. The main logic in attached reproducer lives in node1SpecialSpi - it is a TcpDiscoverySpi node B starts with. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-11493) Test CheckpointFreeListTest#testFreeListRestoredCorrectly always fails in DiskCompression suite
Sergey Chugunov created IGNITE-11493: Summary: Test CheckpointFreeListTest#testFreeListRestoredCorrectly always fails in DiskCompression suite Key: IGNITE-11493 URL: https://issues.apache.org/jira/browse/IGNITE-11493 Project: Ignite Issue Type: Bug Reporter: Sergey Chugunov Test fails with the following NullPointerException in logs: {code} [2019-03-06 16:05:24,353][ERROR][exchange-worker-#94%client%][IgniteTestResources] Critical system error detected. Will be handled accordingly to configured handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=class o.a.i.IgniteCheckedException: null]] class org.apache.ignite.IgniteCheckedException: null at org.apache.ignite.internal.util.IgniteUtils.cast(IgniteUtils.java:7323) at org.apache.ignite.internal.util.future.GridFutureAdapter.resolve(GridFutureAdapter.java:260) at org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:209) at org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:160) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2948) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2769) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NullPointerException at org.apache.ignite.internal.processors.cache.CacheCompressionManager.start0(CacheCompressionManager.java:55) at org.apache.ignite.internal.processors.cache.GridCacheManagerAdapter.start(GridCacheManagerAdapter.java:50) at org.apache.ignite.internal.processors.cache.GridCacheProcessor.initCacheContext(GridCacheProcessor.java:2534) at org.apache.ignite.internal.processors.cache.GridCacheProcessor.prepareCacheContext(GridCacheProcessor.java:2344) at org.apache.ignite.internal.processors.cache.GridCacheProcessor.prepareCacheStart(GridCacheProcessor.java:2270) at org.apache.ignite.internal.processors.cache.GridCacheProcessor.lambda$prepareStartCaches$55a0e703$1(GridCacheProcessor.java:2141) at org.apache.ignite.internal.processors.cache.GridCacheProcessor.lambda$prepareStartCaches$5(GridCacheProcessor.java:2094) at org.apache.ignite.internal.processors.cache.GridCacheProcessor.prepareStartCaches(GridCacheProcessor.java:2138) at org.apache.ignite.internal.processors.cache.GridCacheProcessor.prepareStartCaches(GridCacheProcessor.java:2093) at org.apache.ignite.internal.processors.cache.GridCacheProcessor.startCachesOnLocalJoin(GridCacheProcessor.java:2039) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.initCachesOnLocalJoin(GridDhtPartitionsExchangeFuture.java:951) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:810) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2920) ... 3 more {code} Root cause of it is that CacheManager when initializing CacheContext on client tries to start GridCompressionManager which doesn't make sense on client node. We should either exclude compression manager from cache context on client or not start it during initialization phase. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-11459) Possible dead code in TcpDiscoveryStatusCheckMessage flow
Sergey Chugunov created IGNITE-11459: Summary: Possible dead code in TcpDiscoveryStatusCheckMessage flow Key: IGNITE-11459 URL: https://issues.apache.org/jira/browse/IGNITE-11459 Project: Ignite Issue Type: Improvement Reporter: Sergey Chugunov Working on IGNITE-11364 I found the following suspicious detail about StatusCheck flow: in the message there is a special field {{failedNodeId}} which seems to duplicate functionality of {{failedNodes}} collection in TcpDiscoveryAbstractMessage. {{failedNodeId}} field is filled only in special scenario of failed ping or remote node. It is used *only* to ignore the message. Historical overview of this field revealed commit *838с0fd* where a meaningful piece of code was either intentionally removed or accidentally lost: {noformat} if (msg instanceof TcpDiscoveryStatusCheckMessage) { TcpDiscoveryStatusCheckMessage msg0 = (TcpDiscoveryStatusCheckMessage)msg; if (next.id().equals(msg0.failedNodeId())) { next = null; if (log.isDebugEnabled()) log.debug("Discarding status check since next node has indeed failed [next=" + next + ", msg=" + msg + ']'); // Discard status check message by exiting loop and handle failure. break; } } {noformat} Conclusion: field {{failedNodeId}} and the whole flow around it looks suspicious and has to be reviewed for flaws. Review should result in either redesign of the flow or deleting the code. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-11348) Ping node procedure may fail when another node leaves the cluster
Sergey Chugunov created IGNITE-11348: Summary: Ping node procedure may fail when another node leaves the cluster Key: IGNITE-11348 URL: https://issues.apache.org/jira/browse/IGNITE-11348 Project: Ignite Issue Type: Bug Reporter: Sergey Chugunov Assignee: Sergey Chugunov Fix For: 2.8 Additional pinging of node on join implemented in IGNITE-5569 may incorrectly fail leading to shutting down joining node. The reason for this is that if another node from the same host bound to the same discovery port as joining node has left the cluster right before joining node, socket used for pinging gets closed. This leads to the situation when pinging node considers joining node as "unreachable" and fails it with JOIN_IMPOSSIBLE error code. Workaround: just to start again node failed on join. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-11290) History of server node IDs should be passed to new nodes with NodeAddedMessage
Sergey Chugunov created IGNITE-11290: Summary: History of server node IDs should be passed to new nodes with NodeAddedMessage Key: IGNITE-11290 URL: https://issues.apache.org/jira/browse/IGNITE-11290 Project: Ignite Issue Type: Bug Reporter: Sergey Chugunov Assignee: Sergey Chugunov Fix For: 2.8 As part of IGNITE-5569 (bounded) history of IDs of all server nodes existed in the cluster is introduced to prevent join requests with duplicate IDs if network glitch happens during node's join process. Initial implementation maintains the history locally on each node and isn't transferred to successfully joined nodes. It is needed to pass it (in NodeAdded messages) to new nodes to cover edge-case scenarios of coordinator failover. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-11159) Collections of 'start-on-join' caches and 'init-caches' should be filtered
Sergey Chugunov created IGNITE-11159: Summary: Collections of 'start-on-join' caches and 'init-caches' should be filtered Key: IGNITE-11159 URL: https://issues.apache.org/jira/browse/IGNITE-11159 Project: Ignite Issue Type: Bug Reporter: Sergey Chugunov -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Baseline auto-adjust`s discuss
Anton, As I understand from the IEP document policy was supposed to support two timeouts: soft and hard, so here you're proposing a bit simpler functionality. Just to clarify, do I understand correctly that this feature when enabled will auto-adjust blt on each node join/node left event, and timeout is necessary to protect us from blinking nodes? So no complexities with taking into account number of alive backups or something like that? On Fri, Jan 25, 2019 at 1:11 PM Vladimir Ozerov wrote: > Got it, makes sense. > > On Fri, Jan 25, 2019 at 11:06 AM Anton Kalashnikov > wrote: > > > Vladimir, thanks for your notes, both of them looks good enough but I > > have two different thoughts about it. > > > > I think I agree about enabling only one of manual/auto adjustment. It is > > easier than current solution and in fact as extra feature we can allow > > user to force task to execute(if they doesn't want to wait until timeout > > expired). > > But about second one I don't sure that one parameters instead of two > would > > be more convenient. For example: in case when user changed timeout and > then > > disable auto-adjust after then when someone will want to enable it they > > should know what value of timeout was before auto-adjust was disabled. I > > think "negative value" pattern good choice for always usable parameters > > like timeout of connection (ex. -1 equal to endless waiting) and so on, > but > > in our case we want to disable whole functionality rather than change > > parameter value. > > > > -- > > Best regards, > > Anton Kalashnikov > > > > > > 24.01.2019, 22:03, "Vladimir Ozerov" : > > > Hi Anton, > > > > > > This is great feature, but I am a bit confused about automatic > disabling > > of > > > a feature during manual baseline adjustment. This may lead to > unpleasant > > > situations when a user enabled auto-adjustment, then re-adjusted it > > > manually somehow (e.g. from some previously created script) so that > > > auto-adjustment disabling went unnoticed, then added more nodes hoping > > that > > > auto-baseline is still active, etc. > > > > > > Instead, I would rather make manual and auto adjustment mutually > > exclusive > > > - baseline cannot be adjusted manually when auto mode is set, and vice > > > versa. If exception is thrown in that cases, administrators will always > > > know current behavior of the system. > > > > > > As far as configuration, wouldn’t it be enough to have a single long > > value > > > as opposed to Boolean + long? Say, 0 - immediate auto adjustment, > > negative > > > - disabled, positive - auto adjustment after timeout. > > > > > > Thoughts? > > > > > > чт, 24 янв. 2019 г. в 18:33, Anton Kalashnikov : > > > > > >> Hello, Igniters! > > >> > > >> Work on the Phase II of IEP-4 (Baseline topology) [1] has started. I > > want > > >> to start to discuss of implementation of "Baseline auto-adjust" [2]. > > >> > > >> "Baseline auto-adjust" feature implements mechanism of auto-adjust > > >> baseline corresponding to current topology after event join/left was > > >> appeared. It is required because when a node left the grid and nobody > > would > > >> change baseline manually it can lead to lost data(when some more > nodes > > left > > >> the grid on depends in backup factor) but permanent tracking of grid > > is not > > >> always possible/desirible. Looks like in many cases auto-adjust > > baseline > > >> after some timeout is very helpfull. > > >> > > >> Distributed metastore[3](it is already done): > > >> > > >> First of all it is required the ability to store configuration data > > >> consistently and cluster-wide. Ignite doesn't have any specific API > for > > >> such configurations and we don't want to have many similar > > implementations > > >> of the same feature in our code. After some thoughts is was proposed > to > > >> implement it as some kind of distributed metastorage that gives the > > ability > > >> to store any data in it. > > >> First implementation is based on existing local metastorage API for > > >> persistent clusters (in-memory clusters will store data in memory). > > >> Write/remove operation use Discovery SPI to send updates to the > > cluster, it > > >> guarantees updates order and the fact that all existing (alive) nodes > > have > > >> handled the update message. As a way to find out which node has the > > latest > > >> data there is a "version" value of distributed metastorage, which is > > >> basically . All updates > history > > >> until some point in the past is stored along with the data, so when > an > > >> outdated node connects to the cluster it will receive all the missing > > data > > >> and apply it locally. If there's not enough history stored or joining > > node > > >> is clear then it'll receive shapshot of distributed metastorage so > > there > > >> won't be inconsistencies. > > >> > > >> Baseline auto-adjust: > > >> > > >> Main scenario: > > >> - There is grid with the baseline
[jira] [Created] (IGNITE-11011) Initialize components with grid disco data when NodeAddFinished message is received
Sergey Chugunov created IGNITE-11011: Summary: Initialize components with grid disco data when NodeAddFinished message is received Key: IGNITE-11011 URL: https://issues.apache.org/jira/browse/IGNITE-11011 Project: Ignite Issue Type: Improvement Reporter: Sergey Chugunov Assignee: Sergey Chugunov Fix For: 2.8 There is an issue when CacheProcessor on fresh coordinator (the very first node in new topology) receives grid discovery data from another cluster that died before this node has joined its topology but after sending NodeAdded message. IGNITE-10878 fixes it by filtering cache descriptors and cache groups in GridCacheProcessor which is not generic solution. To fix the issue in a true generic way node should initialize its components (including cache processor) not on receiving NodeAdded message but NodeAddFinished message. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-10819) Test IgniteClientRejoinTest.testClientsReconnectAfterStart became flaky in master recently
Sergey Chugunov created IGNITE-10819: Summary: Test IgniteClientRejoinTest.testClientsReconnectAfterStart became flaky in master recently Key: IGNITE-10819 URL: https://issues.apache.org/jira/browse/IGNITE-10819 Project: Ignite Issue Type: Bug Reporter: Sergey Chugunov Fix For: 2.8 As [test history|https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=-21180267941031641=testDetails_IgniteTests24Java8=%3Cdefault%3E] in master branch shows the test has become flaky in master recently. Test started failing when IGNITE-10555 was merged to master. The reason of failure is timeout when *client4* node hangs waiting for PME to complete. Communication failures are emulated in the test and when all clients fail to init an exchange on a specific affinity topology version (major=7, minor=1) everything works fine. But sometimes *client4* node manages to finish init the exchange and hangs forever. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-10809) IgniteClusterActivateDeactivateTestWithPersistence.testActivateFailover3 fails in master
Sergey Chugunov created IGNITE-10809: Summary: IgniteClusterActivateDeactivateTestWithPersistence.testActivateFailover3 fails in master Key: IGNITE-10809 URL: https://issues.apache.org/jira/browse/IGNITE-10809 Project: Ignite Issue Type: Bug Reporter: Sergey Chugunov Assignee: Sergey Chugunov Fix For: 2.8 Test logic involves independent activation two sets of nodes and then their join into a single cluster. After introducing BaselineTopology concept in 2.4 version this action became prohibited to enforce data integrity. Test should be refactored to take this into account. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] ignite pull request #5600: Ignite 10374
GitHub user sergey-chugunov-1985 opened a pull request: https://github.com/apache/ignite/pull/5600 Ignite 10374 You can merge this pull request into a Git repository by running: $ git pull https://github.com/gridgain/apache-ignite ignite-10374-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/ignite/pull/5600.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5600 commit 52cf2809a759071d440719f968ab0d0040fdf23e Author: EdShangGG Date: 2018-07-17T15:04:38Z IGNITE-9013 Fail cache future when local node is stopping - Fixes #4369. (cherry picked from commit 85b2002796fb601d7e7ce7d7320943f9323c2bdd) commit 92dbb2197488b5c0c61182586d49254263b3a49b Author: Evgeny Stanilovskiy Date: 2018-08-10T14:47:13Z IGNITE-9231 improvement throttle implementation, unpark threads on cpBuf condition. - Fixes #4506. Signed-off-by: Ivan Rakov commit 0e66d270a41caeedee20a93a3ad4aea95f074dce Author: Dmitriy Govorukhin Date: 2018-08-13T08:59:27Z IGNITE-9244 Partition eviction should not take all threads in system pool (cherry picked from commit 2d63040) Signed-off-by: Dmitriy Govorukhin commit fcef6b826ef1eef13c14e752e52c1d05e49ac508 Author: Alexey Kukushkin Date: 2018-04-26T16:31:43Z IGNITE-8237 Ignite blocks on SecurityException in exchange-worker due to unauthorised on-heap cache configuration. - Fixes #3818. Signed-off-by: dpavlov (cherry picked from commit 54cb262438bc83af3c4e864a7e5897b36fcd8c73) commit 0a12c50b4755380ab1b25dceee7420d465a6bfec Author: dpavlov Date: 2018-04-26T16:38:05Z IGNITE-8237 Javadoc for method parameters added. (cherry picked from commit ebe55e3ff84232f67a2885354e3e26426a6b16cb) commit 35cc55405823482e51929d42b74c5e9030bb74e9 Author: Evgeny Stanilovskiy Date: 2018-08-10T13:21:25Z IGNITE-8724 Fixed misleading U.warn implementation - Fixes #4145. commit 3ce67e86bde0c12f954403c12b1b799661f5d5ae Author: Dmitriy Govorukhin Date: 2018-08-14T12:10:30Z Merge remote-tracking branch 'professional/ignite-2.5.1-master' into ignite-2.5.1-master commit 8ee5db8f253a0092a2200593fa28887878cd9a15 Author: Dmitriy Govorukhin Date: 2018-08-10T12:32:19Z IGNITE-9050 WAL iterator should throw an exception if segment tail is reached inside archive directory - Fixes #4429. Signed-off-by: Alexey Goncharuk (cherry picked from commit dbf5574) commit 88c1035404a7a5b778e74a45bd443200602f15bd Author: Dmitriy Govorukhin Date: 2018-08-14T12:03:21Z IGNITE-9260 Fixed failing test (omit check in standalone WAL iterator) - Fixes #4533. Signed-off-by: Alexey Goncharuk (cherry picked from commit 237a99e) commit 80c46addc69d93fd082c37547c6368ee8616ad86 Author: vd-pyatkov Date: 2018-08-15T11:42:42Z IGNITE-8761 WAL fsync at rollover should be asynchronous in LOG_ONLY and BACKGROUND modes - Fixes #4356. Signed-off-by: Ivan Rakov (cherry picked from commit 3e75f9101411f0a6bf72aee1e52b2fc3507792ab) commit fa458b9d715e4f7e01ac2c0363ade40a0106cf29 Author: ascherbakoff Date: 2018-08-11T11:13:26Z IGNITE-9246 Optimistic transactions can wait for topology future on remap for a long time even if timeout is set. (cherry picked from commit 5d151063d554f23858373d642e7875b6a4f206f9) commit eadc99c7052a10168b4072c457be070d48e25dc8 Author: Aleksei Scherbakov Date: 2018-08-13T10:29:29Z IGNITE-9147 Race between tx async rollback and lock mapping on near node can produce hanging primary tx (cherry picked from commit a3f9076e475f603a6c3a457b64f721fd0c24a396) commit 3e18a523817f7175ec3131db26147e352d8a8324 Author: Sergey Kosarev Date: 2018-08-15T14:59:32Z fix imports (IGNITE-9244) commit b4e29ad9a9dcd3572366ed000395ec3ffc34a79f Author: Ivan Daschinskiy Date: 2018-08-16T11:21:19Z GG-14091 Add idle verify dump and idle verify v2 to security chain. commit 64b6504b4273e5a661c1420aea8ecb888f953c30 Author: Denis Mekhanikov Date: 2018-08-16T14:33:52Z IGNITE-9196: SQL: Fixed memory lead in mapper. This closes #4505. (cherry picked from commit bfa192ca473c992353c8bae8ba9aa5fa359378b3) commit b32f510e0d16adca2232f15ae70af1f15eb39940 Author: Pavel Kovalenko Date: 2018-08-16T15:39:49Z IGNITE-9227 Fixed missing reply to a single message during coordinator failover. Fixes #4518. (cherry picked from commit 66fcde3) commit bc1a1c685ec66be5f6360a36f7f842e79b040412 Author: Evgenii Zhuravlev Date: 2018-08-10T11:23:37Z IGNITE-5103 Fixed TcpDiscoverySpi not to ignore maxMissedHeartbeats property. Fixes #4446. (cherry picked from commit 1c840f59016273e0e99c95345c3afde639ef9689) commit d8af4076b65302ea31af461cda3fe747aea7c583 Author: Evgeny Stanilovskiy Date: 2018-08-15T17:28:48Z IGNITE-8493
[GitHub] ignite pull request #5578: IGNITE_DISABLE_WAL_DURING_REBALANCING turned on b...
GitHub user sergey-chugunov-1985 opened a pull request: https://github.com/apache/ignite/pull/5578 IGNITE_DISABLE_WAL_DURING_REBALANCING turned on by default, test for race between checkpointer and affinity change added You can merge this pull request into a Git repository by running: $ git pull https://github.com/gridgain/apache-ignite ignite-10505 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/ignite/pull/5578.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5578 commit a111842d639a9b7f0e5a8d23fc6521b7dffa978e Author: Sergey Chugunov Date: 2018-12-05T12:37:20Z IGNITE-10505 test for race between checkpointer and affinity change commit 12f311d58ba4b7464dde8abc82d06a925b855195 Author: Sergey Chugunov Date: 2018-12-05T12:49:12Z IGNITE-10505 merge master ---
Re: [DISCUSSION] Design document. Rebalance caches by transferring partition files
Eduard, This algorithm looks much easier but could you clarify some edge cased please? If I understand correctly when there is a continuous flow of updates to the page already transferred to receiver checkpointer will write this page to the log file over and over again. Do you see here any risks of exhausting disk space on sender's side? What if some updates come after checkpointer stopped updating log file? How these updates will be transferred to the receiver and applied there? On Tue, Nov 27, 2018 at 7:52 PM Eduard Shangareev < eduard.shangar...@gmail.com> wrote: > So, after some discussion, I could describe another approach on how to > build consistent partition on the fly. > > 1. We make a checkpoint, fix the size of the partition in OffheapManager. > 2. After checkpoint finish, we start sending partition file (without any > lock) to the receiver from 0 to fixed size. > 3. Next checkpoints if they detect that they would override some pages of > transferring file should write the previous state of a page to a dedicated > file. > So, we would have a list of pages written 1 by 1, page id is written in the > page itself so we could determine page index. Let's name it log. > 4. When transfer finished checkpointer would stop updating log-file. Now we > are ready to send it to the receiver. > 5. On receiver side we start merging the dirty partition file with log > (updating it with pages from log-file). > > So, an advantage of this method: > - checkpoint-thread work couldn't increase more than twice; > - checkpoint-threads shouldn't wait for anything; > - in best case, we receive partition without any extra effort. > > > On Mon, Nov 26, 2018 at 8:54 PM Eduard Shangareev < > eduard.shangar...@gmail.com> wrote: > > > Maxim, > > > > I have looked through your algorithm of reading partition consistently. > > And I have some questions/comments. > > > > 1. The algorithm requires heavy synchronization between checkpoint-thread > > and new-approach-rebalance-threads, > > because you need strong guarantees to not start writing or reading to > > chunk which was updated or started reading by the counterpart. > > > > 2. Also, if we have started transferring this chunk in original partition > > couldn't be updated by checkpoint-threads. They should wait for transfer > > finishing. > > > > 3. If sending is slow and partition is updated then in worst case > > checkpoint-threads would create the whole copy of the partition. > > > > So, what we have: > > -on every page write checkpoint-thread should synchronize with > > new-approach-rebalance-threads; > > -checkpoint-thread should do extra-work, sometimes this could be as huge > > as copying the whole partition. > > > > > > On Fri, Nov 23, 2018 at 2:55 PM Ilya Kasnacheev < > ilya.kasnach...@gmail.com> > > wrote: > > > >> Hello! > >> > >> This proposal will also happily break my compression-with-dictionary > patch > >> since it relies currently on only having local dictionaries. > >> > >> However, when you have compressed data, maybe speed boost is even > greater > >> with your approach. > >> > >> Regards, > >> -- > >> Ilya Kasnacheev > >> > >> > >> пт, 23 нояб. 2018 г. в 13:08, Maxim Muzafarov : > >> > >> > Igniters, > >> > > >> > > >> > I'd like to take the next step of increasing the Apache Ignite with > >> > enabled persistence rebalance speed. Currently, the rebalancing > >> > procedure doesn't utilize the network and storage device throughout to > >> > its full extent even with enough meaningful values of > >> > rebalanceThreadPoolSize property. As part of the previous discussion > >> > `How to make rebalance faster` [1] and IEP-16 [2] Ilya proposed an > >> > idea [3] of transferring cache partition files over the network. > >> > From my point, the case to which this type of rebalancing procedure > >> > can bring the most benefit – is adding a completely new node or set of > >> > new nodes to the cluster. Such a scenario implies fully relocation of > >> > cache partition files to the new node. To roughly estimate the > >> > superiority of partition file transmitting over the network the native > >> > Linux scp\rsync commands can be used. My test environment showed the > >> > result of the new approach as 270 MB/s vs the current 40 MB/s > >> > single-threaded rebalance speed. > >> > > >> > > >> > I've prepared the design document IEP-28 [4] and accumulated all the > >> > process details of a new rebalance approach on that page. Below you > >> > can find the most significant details of the new rebalance procedure > >> > and components of the Apache Ignite which are proposed to change. > >> > > >> > Any feedback is very appreciated. > >> > > >> > > >> > *PROCESS OVERVIEW* > >> > > >> > The whole process is described in terms of rebalancing single cache > >> > group and partition files would be rebalanced one-by-one: > >> > > >> > 1. The demander node sends the GridDhtPartitionDemandMessage to the > >> > supplier node; > >> > 2. When the supplier node receives
Re: [VOTE] Creation dedicated list for github notifiacations
+1 Plus this dedicated list should be properly documented in wiki, mentioning it in How to Contribute [1] or in Make Teamcity Green Again [2] would be a good idea. [1] https://cwiki.apache.org/confluence/display/IGNITE/How+to+Contribute [2] https://cwiki.apache.org/confluence/display/IGNITE/Make+Teamcity+Green+Again On Tue, Nov 27, 2018 at 9:51 AM Павлухин Иван wrote: > +1 > вт, 27 нояб. 2018 г. в 09:22, Dmitrii Ryabov : > > > > 0 > > вт, 27 нояб. 2018 г. в 02:33, Alexey Kuznetsov : > > > > > > +1 > > > Do not forget notification from GitBox too! > > > > > > On Tue, Nov 27, 2018 at 2:20 AM Zhenya > wrote: > > > > > > > +1, already make it by filers. > > > > > > > > > This was discussed already [1]. > > > > > > > > > > So, I want to complete this discussion with moving outside dev-list > > > > > GitHub-notification to dedicated list. > > > > > > > > > > Please start voting. > > > > > > > > > > +1 - to accept this change. > > > > > 0 - you don't care. > > > > > -1 - to decline this change. > > > > > > > > > > This vote will go for 72 hours. > > > > > > > > > > [1] > > > > > > > > > > http://apache-ignite-developers.2346864.n4.nabble.com/Time-to-remove-automated-messages-from-the-devlist-td37484i20.html > > > > > > > > > > > > > -- > > > Alexey Kuznetsov > > > > -- > Best regards, > Ivan Pavlukhin >
[jira] [Created] (IGNITE-10409) ExchangeFuture should be in charge on cancelling rebalancing process
Sergey Chugunov created IGNITE-10409: Summary: ExchangeFuture should be in charge on cancelling rebalancing process Key: IGNITE-10409 URL: https://issues.apache.org/jira/browse/IGNITE-10409 Project: Ignite Issue Type: Improvement Reporter: Sergey Chugunov Fix For: 2.8 Ticket IGNITE-7165 introduced improvement of not cancelling any on-going partition rebalancing process when client node joins topology. Client join event doesn't change affinity distribution so on-going rebalance remains valid, no need to cancel it and restart again. Implementation was based on introducing new method *rebalanceRequired* in *GridCachePreloader* interface. At the same time PME optimizations efforts enhanced ExchangeFuture functionality so now the future itself contains all information about weather affinity changed or not. We need to rework code changes from IGNITE-7165 and base it on ExchangeFuture functionality instead of new method in Preloader interface. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] ignite pull request #5468: IGNITE-10374 if rebalance isn't restarted no need...
GitHub user sergey-chugunov-1985 opened a pull request: https://github.com/apache/ignite/pull/5468 IGNITE-10374 if rebalance isn't restarted no need to disable already ⦠â¦disabled WAL You can merge this pull request into a Git repository by running: $ git pull https://github.com/gridgain/apache-ignite ignite-10374 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/ignite/pull/5468.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5468 commit 22d0ac9f0c9ae16bb9c1ce24c8c6f05b586e1073 Author: Sergey Chugunov Date: 2018-11-22T07:56:44Z IGNITE-10374 if rebalance isn't restarted no need to disable already disabled WAL ---
[jira] [Created] (IGNITE-10374) Node doesn't own rebalanced partitions on rebalancing finished
Sergey Chugunov created IGNITE-10374: Summary: Node doesn't own rebalanced partitions on rebalancing finished Key: IGNITE-10374 URL: https://issues.apache.org/jira/browse/IGNITE-10374 Project: Ignite Issue Type: Bug Reporter: Sergey Chugunov Assignee: Sergey Chugunov Fix For: 2.8 Prerequisite: flag *IGNITE_DISABLE_WAL_DURING_REBALANCING* is set to true (default value is false). Scenario: * Node joins the grid and starts rebalancing all cache groups from scratch (e.g. all db files of the node were cleaned up during its downtime); * One or more client nodes join topology when rebalancing is in progress. Expected outcome: Rebalance finishes, node owns all received partitions, new affinity is applied. Actual outcome: Rebalance finishes, but node doesn't own any of received partitions, no affinity changes take place. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Brainstorm: Make TC Run All faster
Dmitriy, You brought up really important topic that has a great impact on our project. Faster runAlls mean quicker feedback and faster progress on issues and features. We have a pretty big code base of tests, about 50 thousands tests. Do we have an idea of how these tests overlap with each other? In my mind it is possible that we have a good bunch of tests that cover the same code and could be replaced with just a single test. In the ideal world we would even determine the minimal set of tests to cover our codebase and remove excessive. -- Best regards, Sergey Chugunov. On Thu, Nov 15, 2018 at 2:34 PM Dmitriy Pavlov wrote: > Hi Igniters, > > > > Some of us started to use the Bot to get an approval of PRs. It helps to > protect master from new failures, but this requires to run RunAll tests set > for each commit and this makes markable pressure to TC infra. > > > > I would like to ask you to share your ideas on how to make runAll faster, > maybe you can share any of your measurements and any other info about > (possible) bottlenecks. > > > > Sincerely, > > Dmitriy Pavlov >
[jira] [Created] (IGNITE-10153) [TC Bot] Implement tests running time report
Sergey Chugunov created IGNITE-10153: Summary: [TC Bot] Implement tests running time report Key: IGNITE-10153 URL: https://issues.apache.org/jira/browse/IGNITE-10153 Project: Ignite Issue Type: Task Reporter: Sergey Chugunov Assignee: Sergey Chugunov In order to optimize running time of existing test base (at the moment all tests require ~ 50 hours of running time on available agents) we need a report page with info about each suite. At the first stage page will show tests running longer than 1 minute for each suite in the latest run in master branch. Later other features may be added like analyzing PRs or other branches, adjusting running time limit (e.g. all tests longer than 30 seconds and so on). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] ignite pull request #5047: IGNITE-9957 Updates count was reduced to speed up...
GitHub user sergey-chugunov-1985 opened a pull request: https://github.com/apache/ignite/pull/5047 IGNITE-9957 Updates count was reduced to speed up the test You can merge this pull request into a Git repository by running: $ git pull https://github.com/gridgain/apache-ignite ignite-9957 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/ignite/pull/5047.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5047 commit 9fb69711329e6651c32c04c1df17f9709ddaec66 Author: Sergey Chugunov Date: 2018-10-22T11:25:28Z IGNITE-9957 Updates count was reduced to speed up the test ---
[jira] [Created] (IGNITE-9958) Optimize execution time of CacheContinuousQueryVariationsTest
Sergey Chugunov created IGNITE-9958: --- Summary: Optimize execution time of CacheContinuousQueryVariationsTest Key: IGNITE-9958 URL: https://issues.apache.org/jira/browse/IGNITE-9958 Project: Ignite Issue Type: Improvement Reporter: Sergey Chugunov Fix For: 2.8 Tests from CacheContinuousQueryVariationsTest require a lot of time ([sample run|https://ci.ignite.apache.org/viewLog.html?buildId=2136245=IgniteTests24Java8_RunAll=testsInfo]) and thus slow down build on TeamCity significantly. They need to be investigated and optimized if possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9957) Optimize execution time of BinaryMetadataUpdatesFlowTest
Sergey Chugunov created IGNITE-9957: --- Summary: Optimize execution time of BinaryMetadataUpdatesFlowTest Key: IGNITE-9957 URL: https://issues.apache.org/jira/browse/IGNITE-9957 Project: Ignite Issue Type: Improvement Reporter: Sergey Chugunov Fix For: 2.8 As TC statistics shows ([example run on master branch|https://ci.ignite.apache.org/viewLog.html?currentGroup=test=org.apache.ignite.testsuites.IgniteBinaryObjectsTestSuite%23teamcity%23org.apache.ignite.internal.processors.cache.binary%23teamcity%23BinaryMetadataUpdatesFlowTest=1=DURATION_DESC=20===IgniteTests24Java8_RunAll=2136245=testsInfo]), three tests within this class require about 6 minutes of running time which is a lot. It is worth investigating tests and speed them up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)