[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/1642 I merged this into master and squashed the commits so I am closing the pull request, but will leave the JIRA open so we can look into pulling it back into 1.x as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/1642 @harshach it has been 6 hours so I am going to check this in because I don't want to have to try and keep it up to date as the code base changes. If you do find anything as time goes on I will be very happy to address your concerns in other pull requests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/1642 @harshach I have +1 from enough people to merge this in and would really like to move forward with it, but if you are still looking at it I want to be sure you get a chance to finish. Please let me know either way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user srdo commented on the issue: https://github.com/apache/storm/pull/1642 @revans2 Thanks, looks great. +1 for merging. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user ptgoetz commented on the issue: https://github.com/apache/storm/pull/1642 I have a few style-related nits, but I've refrained from pointing them out because the style(s) in the codebase are all over the place, and the style is inherited in some places. If we want to get nit picky over style, we should really have a style guide in place. That's a JIRA for another day... Overall I think it looks great and I am +1 for merging. Thanks @revans2 for the great work, and everyone else for the thorough review. In terms of back porting this to other branches, I'm okay with that. While the impetus behind this work was the Clojure to Java migration, I think the improvements made here warrant back-porting to earlier version branches. It also doesn't really affect user-facing APIs, so that's not a concern. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/1642 Not sure why travis has so many issues with the build and maven but here is my travis setup on the same branch with the build passing. https://travis-ci.org/revans2/incubator-storm/builds/160266388 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/1642 @srdo Pushed a new version that hopefully addresses your review comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/1642 @srdo on the `newAssignment.compareAndSet` you are 100% correct. I originally had it in there to clear out the assignment when it is processed, but then realized that each time a new assignment is downloaded it would go back to what it was before and we needed to handle that. I'll clean it up. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user ptgoetz commented on the issue: https://github.com/apache/storm/pull/1642 @revans2 I'm also in the process of reviewing this. I hope to finish by the end of the day. So far everything looks good to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/1642 @harshach did you get a chance to take a look yet? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/1642 The Travis failures are from hive dependencies not being downloaded. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user HeartSaVioR commented on the issue: https://github.com/apache/storm/pull/1642 @revans2 Thanks for the great work. Really appreciated. +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/1642 @HeartSaVioR I found the original comment and put it the change. I also fixed some other review comments from @abellina and also fixed an annoying failure with drpc_auth_test.clj not being able to bind to a port. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user HeartSaVioR commented on the issue: https://github.com/apache/storm/pull/1642 @harshach There's already an another pull request for fixing race condition on old supervisor. Given that we can't set a due date for 2.0 (and maintenance date for 1.x) it would be better to have this as kind of bugfix. @revans2 The code block in my comment fixes the issue. You seem to fix it, but it's gone now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/1642 @HeartSaVioR @harshach Once this goes in I would be happy to make a version for 1.x In fact we are in the process of pulling it back for the version we use internally, and I was able to keep the old supervisor in place as a backup option. As for the regression I will try and find the original code and see where I made a mistake. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user harshach commented on the issue: https://github.com/apache/storm/pull/1642 @HeartSaVioR do we want this for 1.x branch? Given that it's a java migration and pretty much a new feature that can get into 1.x-branch. I would rather keep 1.x-branch as it is. @revans2 will go through it today. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user HeartSaVioR commented on the issue: https://github.com/apache/storm/pull/1642 And I saw some places using lambda which can't be applied to 1.x. @revans2 Could you craft a pull request for 1.x branch (might also 1.0.x) too when we're sure it's good to merge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user HeartSaVioR commented on the issue: https://github.com/apache/storm/pull/1642 I've left a comment of regression issue, but other manual tests passed including failed things. I'd be +1 once regression issue is resolved. Great works. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/1642 @knusbaum @srdo @harshach @HeartSaVioR @abellina I really would appreciate it if you could take another look at this patch. I think everything is ready to go except a final squash before merging it. The unit tests all pass, I have run some basic manual tests with CGroups and run as user enabled. I have run a number of manual tests on my mac, including rebalancing, killing the workers, killing the supervisor, and going back and forth between old and new supervisor. I have manually deleted and added files/directories in storm-dist on the supervisor. I also ran the full set of the internal yahoo integration tests against it and everything has passed. I really would like to start rolling this out internally at yahoo, but I want to get as many eyes looking at it to find issues first. Better here then in production. On the upside the bugs I have found have been rather easy to debug, and very reproducible because the race conditions have been eliminated when interacting with the Container. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/1642 Not sure what has been happening with travis not being able to get to the apache maven repo all the time, but my build in travis passed https://travis-ci.org/revans2/incubator-storm/builds/157725399 despite the official build failing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/1642 I could not reproduce the one failure and the rat failure is fixed by STORM-2054 https://github.com/apache/storm/pull/1648 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/1642 Looks like there were two failures in travis. One is a rat issue with storm-submit-tools the other is an integration test that timed out. I'll try to reproduce the issues and see if I can fix them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/1642 I recovered it and fixed some issues with integration tests/rat. I think it should be good to go. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/1642 @srdo Ya that is what I did. Thanks for the advice. Not sure what happened somehow when I upmerged all but 2 of my commits disappeared. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user srdo commented on the issue: https://github.com/apache/storm/pull/1642 Check the ref log, maybe you can recover https://git-scm.com/docs/git-reflog --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/1642 Sorry, git just ate all of my more recent changes, and I have no idea what happened. I'll try to see if I can recover them, but I might have to start over... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/1642 I just pushed in a much of new fixes and addressed all of the outstanding review comments. I think it is good to go. Please take another look/test it and let me know. I have run tests with rebalancing, rolling upgrades, Recovery, deleting directories from distcache, and added directories to distcache. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/1642 @HeartSaVioR I figured out what happened to make the supervisor crash. In the transition from RUNNING to KILL to speed things up slot starts localizing resources for the new assignment. In the transition from KILL to WAITING_FOR_BASIC_LOCALIZATION slot releases the resources for the container that was just killed. But the reference counting is only for the topology id and port. If the port and topology id are the same, then we get this. I need a key that is specific to the assignment/port, not the topology/port. I'll try to figure out what that should look like. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user d2r commented on the issue: https://github.com/apache/storm/pull/1642 I was able to review most of the code. I skimmed the tests and didn't see anything strange. @revans2 knows that I will not be able to log on again for awhile. I want to leave a note for others that, as my comments were either minor or were something we discussed, I don't want to hold up a merge due to my unresolved review comments should this change otherwise be ready to merge before I am back. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/1642 @HeartSaVioR Yes it looks like I need to think through recovery and any races with the AsyncLocalizer a bit more. I'll try to reproduce your error. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user HeartSaVioR commented on the issue: https://github.com/apache/storm/pull/1642 I found another issue: When I rebalance 3 workers into 1 worker, all workers are killed first (expected) and AsyncLocalizer clear out topology codes since all workers are killed. But Supervisor doesn't download topology code again while starting new worker, so it goes wrong and worker and/or supervisor are killed. This seems to be a kind of race condition (Slot and AsyncLocalizer) and I saw two scenarios: 1. Worker can be launched but topology directory is removed after launching so worker is crashed. Slot tries to relaunch worker but throws IllegalStateException because topology directory is gone and supervisor also be killed. 2. Supervisor is killed even before launching worker. After this, Supervisor will consistently be killed unless clearing out supervisor directory as same as above comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user HeartSaVioR commented on the issue: https://github.com/apache/storm/pull/1642 With my patch (symlink issue and NPE issue) I can see workers launched and killed by Supervisor V2. (remote) Tested: - kill worker process with -9 - rebalance with different worker count - jstack dump, heap dump, restart worker --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user HeartSaVioR commented on the issue: https://github.com/apache/storm/pull/1642 Found other issue: Constructor of BasicContainer (in fact constructor of Container) throws FNF when topology dist files are deleted. So even though Slot is creating with recover mode, Supervisor is killed instead of throwing ContainerRecoveryException. Supervisor will consistently be killed unless we remove the assignment for that - killing topology. This is a weird case but can be happening, and I guess that's why ContainerRecoveryException exists. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user HeartSaVioR commented on the issue: https://github.com/apache/storm/pull/1642 I merged this in my local, and do some tests, and see: 1. integration test fails from my dev. machine - I'm using OSX 10.11, Java 1.8.0_66 - I got wrong IP address from integration test. I'm using IP sharing router but assigned IP is even not same as external IP. -- This is similar to what I consistently observed from my shared office. -- In my office even build for current master is failing. (I'm testing this pull request in home now.) -- Strange thing is that normal unit tests are not failing for this build. In my office it's also failing. 2. There's a bug for creating symbolic link for dependencies, and also createSymbolicLink. I'll leave a comment to diff. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/1642 I think I have addressed most of the issues so far. I have been running some manual tests and have run a cluster with run as user and cgroup enforcement enabled. I plan on doing some more tests, but I think the current pull request should be ready for anyone who whats to take a look at it/test it to do so. It should be a drop in replacement for the current supervisor, and I have even tested it doing a rolling upgrade/downgrade. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/1642 @HeartSaVioR take your time. I want to be sure that I have plenty of eyeballs looking at this. We are doing this mostly because we started to run into a lot of race conditions in the supervisor. We would try to fix one, and another would be exposed. This makes it so that interacting with the container is only ever done on a single thread so we can reason about the state of the container and avoid those races. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user HeartSaVioR commented on the issue: https://github.com/apache/storm/pull/1642 I'm still reviewing this. Nice work to refactor to introduce greater readability. Will leave comment here once I'm done with first pass. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm issue #1642: STORM-2018: Supervisor V2.
Github user revans2 commented on the issue: https://github.com/apache/storm/pull/1642 @d2r @abellina @harshach @knusbaum I believe that I am done and that the code is ready to merge. I have addressed all of the review comments to date. I have finished with updating the unit tests. I am going to start doing more extensive testing of CGROUP and run as user support, but because we are not too close to a 2.x release I think we can address any issues I find in follow up JIRA. Please take another look and let me know. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---