Re: About Geode rolling downgrade
That's right, most/always no down-time requirement is managed by having replicated cluster setups (Disaster-recovery/backup site). The data is either pushed to both systems through the data ingesters or by using WAN setup. The clusters are upgraded one at a time. If there is a failure during upgrade or needs to be rolled back; one system will be always up and running. -Anil. On Wed, Apr 22, 2020 at 1:51 PM Anthony Baker wrote: > Anil, let me see if I understand your perspective by stating it this way: > > If cases where 100% uptime is a requirement, users are almost always > running a disaster recovery site. It could be active/active or > active/standby but there are already at least 2 clusters with current > copies of the data. If an upgrade goes badly, the clusters can be > downgraded one at a time without loss of availability. This is because we > ensure compatibility across the wan protocol. > > Is that correct? > > > Anthony > > > > > On Apr 22, 2020, at 10:43 AM, Anilkumar Gingade > wrote: > > > >>> Rolling downgrade is a pretty important requirement for our customers > >>> I'd love to hear what others think about whether this feature is worth > > the overhead of making sure downgrades can always work. > > > > I/We haven't seen users/customers requesting rolling downgrade as a > > critical requirement for them; most of the time they had both an old and > > new setup to upgrade or switch back to an older setup. > > Considering the amount of work involved, and code complexity it brings > in; > > while there are ways to downgrade, it is hard to justify supporting this > > feature. > > > > -Anil. > >
Re: About Geode rolling downgrade
Anil, let me see if I understand your perspective by stating it this way: If cases where 100% uptime is a requirement, users are almost always running a disaster recovery site. It could be active/active or active/standby but there are already at least 2 clusters with current copies of the data. If an upgrade goes badly, the clusters can be downgraded one at a time without loss of availability. This is because we ensure compatibility across the wan protocol. Is that correct? Anthony > On Apr 22, 2020, at 10:43 AM, Anilkumar Gingade wrote: > >>> Rolling downgrade is a pretty important requirement for our customers >>> I'd love to hear what others think about whether this feature is worth > the overhead of making sure downgrades can always work. > > I/We haven't seen users/customers requesting rolling downgrade as a > critical requirement for them; most of the time they had both an old and > new setup to upgrade or switch back to an older setup. > Considering the amount of work involved, and code complexity it brings in; > while there are ways to downgrade, it is hard to justify supporting this > feature. > > -Anil.
[DISCUSS] preparing for Geode 1.13.0 release
Geode is scheduled to cut support/1.13 on May 4, as per the quarterly release schedule approved [1] in 2018 and affirmed in last month’s “Shipping patch releases” RFC [2]. Please volunteer if you are interested in serving as Release Manager for Geode 1.13.0. [1] https://lists.apache.org/thread.html/8626f7cc73b49cc90129ec5f6021adab3815469048787032935bfc1e%40%3Cdev.geode.apache.org%3E [2] https://cwiki.apache.org/confluence/display/GEODE/Shipping+patch+releases
Re: About Geode rolling downgrade
>> Rolling downgrade is a pretty important requirement for our customers >> I'd love to hear what others think about whether this feature is worth the overhead of making sure downgrades can always work. I/We haven't seen users/customers requesting rolling downgrade as a critical requirement for them; most of the time they had both an old and new setup to upgrade or switch back to an older setup. Considering the amount of work involved, and code complexity it brings in; while there are ways to downgrade, it is hard to justify supporting this feature. -Anil. On Tue, Apr 21, 2020 at 2:01 PM Dan Smith wrote: > > Anyhow, we wonder what would be as of today the recommended or official > way to downgrade a Geode system without downtime and data loss? > > I think the without downtime option is difficult right now. The most bullet > proof way to downgrade without data loss is probably just to export/import > the data, but that involves downtime. In many cases, you could restart the > system with an old version if you have persistent data because the on disk > format doesn't change that often, but that won't work in all cases. Or if > you have multiple redundant WAN sites you could potentially shift traffic > from one to the other and recreate a WAN site, but that also requires some > work. > > > Rolling downgrade is a pretty important requirement for our customers so > we would not like to close the discussion here and instead try to see if it > is still reasonable to propose it for Geode maybe relaxing a bit the > expectations and clarifying some things. > > I agree that rolling downgrade is a useful feature for some cases. I also > agree we would need to add a lot of tests to make sure we really can > support it. I'd love to hear what others think about whether this feature > is worth the overhead of making sure downgrades can always work. As Bruce > pointed out, we have made changes in the past and we will make changes in > the future that may need additional logic to support downgrades. > > Regarding your downgrade steps, they look reasonable. You might consider > downgrading the servers first. Rolling *upgrade* upgrades the locators > first, so up to this point we have only tested a newer locator with an > older server. > > -Dan > > On Mon, Apr 20, 2020 at 9:13 AM wrote: > > > Hi, > > > > I agree that if we wanted to support limited rolling downgrade some other > > version interchange needs to be done and extra tests will be required. > > > > Nevertheless, this could be done using gfsh or with a startup parameter. > > For example, in the case you mentioned about the UDP messaging, some > > command like: "enable UDP messaging" to put the system again in a state > > equivalent to "upgrade in progress but not yet completed" that would > allow > > old members to join again. > > I guess for each case there would be particularities but they should not > > involve a lot of effort because most of the mechanisms needed (the ones > > that allow old and new members to coexist) will have been developed for > the > > rolling upgrade. > > > > Anyhow, we wonder what would be as of today the recommended or official > > way to downgrade a Geode system without downtime and data loss? > > > > > > > > From: Bruce Schuchardt > > Sent: Friday, April 17, 2020 11:36 PM > > To: dev@geode.apache.org > > Subject: Re: About Geode rolling downgrade > > > > Hi Alberto, > > > > I think that if we want to support limited rolling downgrade some other > > version interchange needs to be done and there need to be tests that > prove > > that the downgrade works. That would let us document which versions are > > compatible for a downgrade and enforce that no-one attempts it between > > incompatible versions. > > > > For instance, there is work going on right now that introduces > > communications changes to remove UDP messaging. Once rolling upgrade > > completes it will shut down unsecure UDP communications. At that point > > there is no way to go back. If you tried it the old servers would try to > > communicate with UDP but the new servers would not have UDP sockets open > > for security reasons. > > > > As a side note, clients would all have to be rolled back before starting > > in on the servers. Clients aren't equipped to talk to an older version > > server, and servers will reject the client's attempts to create > connections. > > > > On 4/17/20, 10:14 AM, "Alberto Gomez" wrote: > > > > Hi Bruce, > > > > Thanks a lot for your answer. We had not thought about the changes in > > distributed algorithms when analyzing rolling downgrades. > > > > Rolling downgrade is a pretty important requirement for our customers > > so we would not like to close the discussion here and instead try to see > if > > it is still reasonable to propose it for Geode maybe relaxing a bit the > > expectations and clarifying some things. > > > > First, I think supporting rolling downgrade does not
Re: Data ingestion with predefined buckets
Steve, Have you looked at grouping your putAll() requests into groups that align to Geode’s buckets? In your application code, you can determine the hash for each data item and self-partition the entries. This allows you to send the requests on separate threads in parallel while optimizing network traffic. I have seen this used for very high-throughput ingest use cases. Anthony > On Apr 16, 2020, at 11:09 AM, Anilkumar Gingade wrote: > >>> PutAllPRMessage.* > > These are internal APIs/message protocols used to handle PartitionedRegin > messages. > The messages are sent from originator node to peer nodes to operate on a > given partitioned region; not intended as application APIs. > > We could consider, looking at the code, which determines bucket-id for each > of putAll keys. If there is routing info that identifies a common data > store (bucket); the code could be optimized there... > > My recommendation is still using the existing APIs and trying to tune the > putAll map size. By reducing the map size, you will be pushing small chunks > of data to the server, while remaining data is acted upon (at client); > which keeps both client and server busy at the same time. You can also look > at tuning socket buffer size, to fit your data size so that the data is > written/read in a single chunk. > > -Anil > > > On Wed, Apr 15, 2020 at 7:01 PM steve mathew > wrote: > >> Anil, yes its a kind of custom hash (which involves calculating hash on all >> fields of row). Have to stick to the predefined mechanism based on which >> source files are generated. >> >> It would be great help if some-one guide me about any available >> *server-side >> internal API that provides bucket level data-ingestion if any*. While >> exploring came across "*PartitionRegion.sendMsgByBucket(bucketId, >> PutAllPRMessage)*"..Can this API internally takes care of redundancy >> (ingestion into secondary buckets on peer nodes)..? >> >> Can someone explain about >> *PutAllPRMessage.operateOnPartitionedRegion(ClusterDistributionManager >> dm, PartitionedRegion pr,..)*, it seems this handles putAll msg from peer.. >> When is this required..? >> >> Thanks >> >> Steve M. >> >> On Wed, Apr 15, 2020 at 11:06 PM Anilkumar Gingade >> wrote: >> >>> About api: I would not recommend using bucketId in api, as it is internal >>> and there are other internal/external apis that rely on bucket id >>> calculations; which could be compromised here. >>> >>> Instead of adding new APIs, probably looking at minimizing/reducing the >>> time spent may be a good start. >>> >>> BucketRegin.waitUntilLocked - A putAll thread could spend time here, when >>> there are multiple threads acting upon the same thread; one way to reduce >>> this is by tuning the putall size, can you try changing our putall size >>> (say start with 100). >>> >>> I am wondering about the time spent in hashcode(); is it a custom code? >>> >>> If you want to create the buckets upfront, you can try calling the >> method: >>> PartitionRegionHelper.assignBucketsToPartitions(). >>> >>> -Anil >>> >>> >>> On Wed, Apr 15, 2020 at 8:37 AM steve mathew >>> wrote: >>> Thanks Den, Anil and Udo for your inputs. Extremely sorry for late rely >>> as I took bit of time to explore and understand geode internals. It seems BucketRegion/Bucket terminology is not exposed to user but >>> still i am trying to achieve something that is uncommon and for which client >> API >>> is not exposed. *Details about Use-case/Client * - MultiThreadClient - Each task perform data-ingestion on specific >>> bucket. Each task knows the bucket number to ingest data. In-short client knows task-->bucket mapping. - Each task iteratively ingest-data into batch (configurable) of 1000 records to the bucket assigned to it. - Parallelism is achieved by running multiple tasks concurrently. *When i tried with exisitng R.putAll() API, observed slow performance >> and related observations are* - Few tasks takes quite a longer time >>> (ThreaDump shows--> Thread WAITING on BucketRegin.waitUntilLocked), hence overall client takes longer time. - Code profiling shows good amount of time spent during hash-code calculation. It seems key.hashCode() gets calculated in on both client >>> and server, which is not required for my use-case as task-->bucket mapping known before. - putAll() client implementation takes care of Parallelism (using PRMetadata enabled thread-pool and reshuffle the keys internally), but >> in my-case that's taken care by multiple tasks each per buckrt within my client. *I have forked the Geode codebase and trying to extend it by providing >> a client API like, * //Region.java /** * putAll records in specified bucket */ *public void putAll(int bucketId, map) * Already added client side message and related code
Re: Reconfiguring our notifications and more
Sorry for the delay, my day went very sideways yesterday. I’ve pushed changes to master (and develop where needed) for the following repos: geode geode-benchmarks geode-examples geode-kafka-connector geode-native geode-site I’m not entirely satisfied it’s doing everything we want (like linking JIRA’s to PR’s) but at least it directs the notifications to the right ML. Anthony > On Apr 21, 2020, at 8:54 AM, Anthony Baker wrote: > > I’d like a quick round of feedback so we can stop the dev@ spamming [1]. > > ASF has implemented a cool way to give projects control of lots of things > [2]. In short, you provide a .asf.yml in each and every GitHub repo to > control lots of interesting things. For us the most immediately relevant are: > > notifications: > commits: comm...@geode.apache.org > issues: iss...@geode.apache.org > pullrequests: notificati...@geode.apache.org > jira_options: link label comment > > I’d like to commit this to /develop and cherry-pick over to /master ASAP. > Later on we can explore the website and GitHub sections. Let me know what > you think. > > Anthony > > > [1] https://issues.apache.org/jira/browse/INFRA-20156 > [2] > https://cwiki.apache.org/confluence/display/INFRA/.asf.yaml+features+for+git+repositories#id-.asf.yamlfeaturesforgitrepositories-Notificationsettingsforrepositories
Fixed: apache/geode-examples#439 (develop - cc644fa)
Build Update for apache/geode-examples - Build: #439 Status: Fixed Duration: 22 mins and 7 secs Commit: cc644fa (develop) Author: Anthony Baker Message: Updated notifications configuration file View the changeset: https://github.com/apache/geode-examples/compare/28f580168a00...cc644faadd61 View the full build log and details: https://travis-ci.org/github/apache/geode-examples/builds/678218603?utm_medium=notification_source=email -- You can unsubscribe from build emails from the apache/geode-examples repository going to https://travis-ci.org/account/preferences/unsubscribe?repository=11483039_medium=notification_source=email. Or unsubscribe from *all* email updating your settings at https://travis-ci.org/account/preferences/unsubscribe?utm_medium=notification_source=email. Or configure specific recipients for build notifications in your .travis.yml file. See https://docs.travis-ci.com/user/notifications.
[GitHub] [geode] alb3rtobr commented on issue #4978: Fix for regression introduced by GEODE-7565
alb3rtobr commented on issue #4978: URL: https://github.com/apache/geode/pull/4978#issuecomment-617861442 > One additional comment regarding the following waning message: > > ``` > [warn 2020/04/18 23:44:22.757 PDT tid=0x298] Unable to ping non-member rs-FullRegression19040559a2i32xlarge-hydra-client-63(bridgegemfire1_host1_4749:4749):41003 for client identity(rs-FullRegression19040559a2i32xlarge-hydra-client-63(edgegemfire3_host1_1071:1071:loner):50046:5a182991:edgegemfire3_host1_1071,connection=2 > ``` > > The above is logged by member `bridgegemfire1_host1_4749` in the tests, exactly the member to which the `ping` command was sent... this warning is logged within the `pingCorrectServer` method which, unless I'm missing something, shouldn't be invoked at all if this member is the actual recipient for the `ping` message. Maybe `!myID.equals(targetServer)` is not working as expected?. I have added a log right after the comparision to check which values were considered different. I have created two clusters with two servers, and in my case I dont see the comparision is not working, this is an example: ``` [info 2020/04/22 15:05:01.338 GMT tid=0x56] ALBERTO - MyID=172.17.0.3(server-0:85):41000 - target=172.17.0.9(server-1:84):41000 ``` Could you add the same log to your code? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [geode] prettyClouds opened a new pull request #4984: Refactor: Split SetsIntegrationTest into multiple files
prettyClouds opened a new pull request #4984: URL: https://github.com/apache/geode/pull/4984 Thank you for submitting a contribution to Apache Geode. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [ ] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [ ] Has your PR been rebased against the latest commit within the target branch (typically `develop`)? - [ ] Is your initial contribution a single, squashed commit? - [ ] Does `gradlew build` run cleanly? - [ ] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? ### Note: Please ensure that once the PR is submitted, check Concourse for build issues and submit an update to your PR as soon as possible. If you need help, please send an email to dev@geode.apache.org. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [geode] jujoramos commented on issue #4978: Fix for regression introduced by GEODE-7565
jujoramos commented on issue #4978: URL: https://github.com/apache/geode/pull/4978#issuecomment-617814149 One additional comment regarding the following waning message: ``` [warn 2020/04/18 23:44:22.757 PDT tid=0x298] Unable to ping non-member rs-FullRegression19040559a2i32xlarge-hydra-client-63(bridgegemfire1_host1_4749:4749):41003 for client identity(rs-FullRegression19040559a2i32xlarge-hydra-client-63(edgegemfire3_host1_1071:1071:loner):50046:5a182991:edgegemfire3_host1_1071,connection=2 ``` The above is logged by member `bridgegemfire1_host1_4749` in the tests, exactly the member to which the `ping` command was sent... this warning is logged within the `pingCorrectServer` method which, unless I'm missing something, shouldn't be invoked at all if this member is the actual recipient for the `ping` message. Maybe `!myID.equals(targetServer)` is not working as expected?. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [geode] jdeppe-pivotal commented on issue #4861: GEODE-7905: RedisDistDUnitTest failing intermittently in CI
jdeppe-pivotal commented on issue #4861: URL: https://github.com/apache/geode/pull/4861#issuecomment-617771032 Since we're actively working on various approaches to this, and it's probably still going to take a while, I'm going to close this PR for now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [geode] jujoramos commented on a change in pull request #4978: Fix for regression introduced by GEODE-7565
jujoramos commented on a change in pull request #4978: URL: https://github.com/apache/geode/pull/4978#discussion_r412828019 ## File path: geode-core/src/main/java/org/apache/geode/cache/client/internal/PingOp.java ## @@ -43,12 +45,16 @@ private PingOp() { static class PingOpImpl extends AbstractOp { private long startTime; +private ServerLocation location; Review comment: The above fields are not used anywhere, can we just remove them?, or am I missing something?. ## File path: geode-core/src/main/java/org/apache/geode/cache/client/internal/PingOp.java ## @@ -65,8 +71,9 @@ protected boolean needsUserId() { @Override protected void sendMessage(Connection cnx) throws Exception { getMessage().clearMessageHasSecurePartFlag(); - this.startTime = System.currentTimeMillis(); - getMessage().send(false); + getMessage().setNumberOfParts(1); + getMessage().addObjPart(serverID); + getMessage().send(true); Review comment: These operations can be directly executed within the `PingOpImpl` constructor, as we do with the rest of the messages. I've also noted that you changed the `clearMessage` flag from `false` to `true` as well, any reasons behind that?. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [geode] jujoramos commented on a change in pull request #4970: GEODE-7676: Add PR clear with expiration tests
jujoramos commented on a change in pull request #4970: URL: https://github.com/apache/geode/pull/4970#discussion_r412794182 ## File path: geode-core/src/distributedTest/java/org/apache/geode/internal/cache/PartitionedRegionClearWithExpirationDUnitTest.java ## @@ -0,0 +1,537 @@ +/* Review comment: > The test needs 30 minutes to finish all combination. And many tests took more than 30 seconds each. This is not a bad thing per se, I wanted to test all possible combinations so it's expected for the distributed test to take some time to finish. I'll remove some combinations, though, so the overall time will be reduced. > Many combinations are unnecessary, for example we want to see the expiration tasks are cancelled, we don't care what expiration. Only when we want to see an expiration to be triggered, then we need some (not all) expiration types. In my opinion, we can hard code to use DESTROY expiration type only in this test. We don't have to verify clear from accessor or server. Some other tests have verified that. You can just use server to do clear. Region types can also be reduced to a few. We have other tests to verify that all the combination of region type can do clear successfully. We only need to verify the expiration tasks are cleared in this test. Having tests to verify several possible combinations is better than having no tests at all, specially for the region types as we might change the implementation class in the future... if we do, this test might be able to catch possible regressions introduced, so I'd prefer to keep them all. Regarding the `ExpirationAction`, I agree, will remove the `INVALIDATE` one and just use `DESTROY`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [geode] jujoramos commented on issue #4978: Fix for regression introduced by GEODE-7565
jujoramos commented on issue #4978: URL: https://github.com/apache/geode/pull/4978#issuecomment-617642403 @bschuchardt @alb3rtobr : I'm still analysing the issue and seeing whether I can reproduce it consistently, so far it's proving quite elusive as it only fails once or twice in around 100 runs of my test. Will keep you posted. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org