[GitHub] storm pull request: STORM-188: Allow user to specifiy full configu...
Github user clockfly closed the pull request at: https://github.com/apache/storm/pull/120 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-188: Allow user to specifiy full configu...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/120#issuecomment-146773478 Sure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-188. Allow user to specifiy full configu...
Github user clockfly commented on a diff in the pull request: https://github.com/apache/storm/pull/495#discussion_r27561927 --- Diff: storm-core/src/jvm/backtype/storm/utils/Utils.java --- @@ -135,35 +137,68 @@ public static void sleep(long millis) { } public static Map findAndReadConfigFile(String name, boolean mustExist) { +InputStream in = null; +Boolean confFileEmpty = false; try { -HashSetURL resources = new HashSetURL(findResources(name)); -if(resources.isEmpty()) { -if(mustExist) throw new RuntimeException(Could not find config file on classpath + name); -else return new HashMap(); -} -if(resources.size() 1) { -throw new RuntimeException(Found multiple + name + resources. You're probably bundling the Storm jars with your topology jar. - + resources); -} -URL resource = resources.iterator().next(); -Yaml yaml = new Yaml(new SafeConstructor()); -Map ret = null; -InputStream input = resource.openStream(); -try { -ret = (Map) yaml.load(new InputStreamReader(input)); -} finally { -input.close(); +in = getConfigFileInputStream(name); +if (null != in) { +Yaml yaml = new Yaml(new SafeConstructor()); +Map ret = (Map) yaml.load(new InputStreamReader(in)); +if (null != ret) { +return new HashMap(ret); +} else { +confFileEmpty = true; +} } -if(ret==null) ret = new HashMap(); - -return new HashMap(ret); - +if (mustExist) { +if(confFileEmpty) +throw new RuntimeException(Config file + name + doesn't have any valid storm configs); +else +throw new RuntimeException(Could not find config file on classpath + name); +} else { +return new HashMap(); +} } catch (IOException e) { throw new RuntimeException(e); +} finally { +if (null != in) { +try { +in.close(); +} catch (IOException e) { +throw new RuntimeException(e); +} +} } } +private static InputStream getConfigFileInputStream(String configFilePath) +throws IOException { +if (null == configFilePath) { +throw new IOException( +Could not find config file, name not specified); +} + +HashSetURL resources = new HashSetURL(findResources(configFilePath)); +if (resources.isEmpty()) { +File configFile = new File(configFilePath); +if (configFile.exists()) { +return new FileInputStream(configFile); +} --- End diff -- @lazyval mustExist flag will control whether we should fail the method. ···public static Map findAndReadConfigFile(String name, boolean mustExist) ··· When we cannot find a resource, we will return null; then we check whether mustExist flag is enabled to determine whether we should throw RuntimeException. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-188. Allow user to specifiy full configu...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/495#issuecomment-88295467 +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-82196521 I will ask Ted to close this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Client (Netty): improving logging to help trou...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/463#issuecomment-78255547 +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-74084548 +1 On Thu, Feb 12, 2015 at 6:44 PM, Michael G. Noll notificati...@github.com wrote: Thanks for your feedback, Nathan. As far as I understand this patch does not enable backpressure. But: because there is no backpressure (yet) that we can rely on, this patch will improve at least the situation during the startup phase of a topology to prevent that a) an unacked topo will not lose messages during the startup, and b) we do not need to unnecessarily replay messages in case of acked topos during their startup. This is achieved by checking that all worker connections are ready before the topology starts processing data. So backpressure is still an open feature. Backpressure was IIRC mentioned in the initial PR because there was a deficiency (dating back to a ZMQ related TODO) that caused problems related to this PR/Storm tickets (327, 404, and one more). However, this patch does make the best of the current situation even in the absence of backpressure. But first and foremost this patch fixes a (critical) cascading failure that can bring Storm clusters to a halt. Please correct me if I'm mistaken in my summary. â Reply to this email directly or view it on GitHub https://github.com/apache/storm/pull/429#issuecomment-74050954. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-73909944 @miguno, I am sure you are aware that you can still send data to this channel even if channel.isWritable return false. check http://netty.io/4.0/api/io/netty/channel/Channel.html#isWritable() boolean isWritable() Returns true if and only if the I/O thread will perform the requested write operation **immediately**. Any write requests made when this method returns false are **queued** until the I/O thread is ready to process the queued write requests. doing or not doing isWritable check have nothing to do with message loss, the isWritable check is purely used to optimize the performance for small messages. case1: If we have a large enough batch, then we will just flush to netty internal queue. Netty will flush pending data in queueto wire when the wire is not that busy. case2: If we don't have a large enough batch, and the wire is busy, then the Storm netty Client will wait for a while, buffer more messages in the batch, and set a timer to flush later. case3: If we don't have a large enough batch, and the wire is free, then Storm Client will flush immediately, so that we have better latency. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-73910187 So, I don't think your TODO comment is a issue, it is actually designed like this, how do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/428#issuecomment-73925561 Something like git rebase -i upstream/master --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/428#issuecomment-73932513 I see, there are multiple remote-merging, which then make rebase impossible. How about create a patch file against master and then apply the patch with new commit logs? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-73651894 +1 on the update, the patch is well written. I made a few comments under reivew panel of https://github.com/miguno/storm/commit/8ebaaf8dbc63df3c2691e0cc3ac5102af7721ec3#diff-e1bd524877b15ccf409f846e3c95da13R203 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-188: Allow user to specifiy full configu...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/120#issuecomment-69435220 @revans, Feel free to do what you want, change it, or replace it.:) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-67047625 @nathanmarz , I'd like to explain why I need to change worker.clj. This was also motivated by a legacy TODO in in zmq.clj. https://github.com/nathanmarz/storm/blob/0.8.2/src/clj/backtype/storm/messaging/zmq.clj#L43 ``` (send [this task message] ... (mq/send socket message ZMQ/NOBLOCK)) ;; TODO: how to do backpressure if doing noblock?... need to only unblock if the target disappears ``` As we can see, zeromq transport will send message in non-blocking way. If I understand this TODO correctly, it wants, a) When target worker is not booted yet, the source worker should not send message to target. Otherwise, as there is no backpressure, there will be message loss during the bootup phase. If it is un unacked topology, the message loss is permanent; if it is an acked topology, we will need to do unnecessary replay. b) When target worker disappears in the middle(crash?), the source worker should drop the messages directly. The problem is that: transport layer don't know by itself whether the target worker is booting up or crashed in the running phase, so it cannot smartly choose between back pressure or drop. If the transport simplifiy choose block, it is good for booting up phase, but bad for running phase. If one connection is down, it may block messages sent to other connections. If the transport simplify choose drop, it is good for running phase, but bad for booting up phase. If the target worker is booted 30 seconds later, all message between this 30 seconds will be dropped. The changes in worker.clj is targeted to solve this problem. Worker knows when the target worker connections are ready. In the bootup phase, worker.clj will wait target worker connection is ready, then it will activate the source worker tasks. In the âruntime phase, the transport will simply drop the messages if target worker crashed in the middle. There will be several benefits: 1. During cluster bootup, for unacked topology, there will be no strange message loss. 2. During cluster bootup, for acked topology, it can take less time to reach the normal throughput, as there is no message loss, timeout, and replay. 3. For transport layer, the design is simplified. We can just drop the messages if target worker is not available. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-66620738 @tedxia Thanks, I think this will work. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-66620748 +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-65763659 Thanks Tedï¼ Yes, we need a fine-grained lock at âsynchronrized connect() â. I see you changed it to use schedule Runnable, it may cause deadlock, as schedule is a blocking operation. connect() wait for threadpool.schedule() to release slots, threadpool.schedule() wait for connect() to exit Looking forward to your update on this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-63443152 It seems common in storm UT to have random failures. We may need to clean Storm UT to make it faster and more robust. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-188: Allow user to specifiy full configu...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/120#issuecomment-63431309 The original behavior of findAndReadConfigFile() is to locate config file on classpath. findResources(name) will not be empty when name exists on classpath. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-62509705 @tedxia The thread pool size of clientScheduleService is decided by worker number (also =1 and = 10). For example, if there are 2 worker, the pool size is 1, if worker number is 4, then the pool size is 3. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-550:fix bug get an error when use --conf...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/310#issuecomment-62510083 Seems this is duplicate of STORM-188 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: [STORM-537] A worker reconnects infinitely to ...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/304#issuecomment-61921292 +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: [STORM-537] A worker reconnects infinitely to ...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/304#issuecomment-61605751 @Sergeant007, Nice find about the channel = null. I am a little scared by the complexity of the test code, I am +1 if you: 1. add a comment on top of ```channel = null;```, such as set it to null to make sure we thrown when reconnection fail' 2. remove the complex test or replace with clean one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: [STORM-537] A worker reconnects infinitely to ...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/304#issuecomment-61605948 Also, can you remove the braket [] around the storm id. github cannot link with apache jira if you have that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-456:Storm UI cannot navigate to topology...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/303#issuecomment-61631456 @NareshKosgi, Have you considered other special character that may used as topology name? Like the table, #, , (, and etc.. Is it possble that this fix only works for space, but fails for table? Another approach is to add strict name checking when submitting topology. This may make life easier? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: [STORM-442] multilang ShellBolt/ShellSpout die...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/305#issuecomment-61631943 @itaifrenkel, ``` When you say the parent process fails do you mean that the worker process is no longer running? ``` No, it means the case described by dashengju, that an exception is thrown in the java space(not originated from sub process) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-378,SleepSpoutWaitStrategy.emptyEmit sho...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/295#issuecomment-61634029 To make the whole topology responsive, the spout need to stay active to pull data frequently from acker or system tick. When setting topology.sleep.spout.wait.strategy.time.ms to 1 ms, it should be good enough, the system load is relatively small. What is the motivation to make it increasing? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-493: Workers inherit storm.conf.file/sto...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/252#issuecomment-61639159 +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-497: don't modify the mapping while the ...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/257#issuecomment-61579721 @HeartSaVioR, The performance is not a concern, since taskToQueueId will only be modified for a few times. +1 for the patch --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-188: Allow user to specifiy full configu...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/120#issuecomment-61599764 @d2r , sorry it takes so long. Now, the patch is synced with upstream. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Update README.md
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/292#issuecomment-61600425 +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-253: Allow storm worker to use dynamic p...
GitHub user clockfly reopened a pull request: https://github.com/apache/storm/pull/45 STORM-253: Allow storm worker to use dynamic port instead of pre-configed. Add a ne... PR for [STORM-253](https://issues.apache.org/jira/browse/STORM-253) Allow supervisor to decide worker dynamiclly instead of pre-configured in configure file. This is especially useful in a shared cluster environment. Changes: 1. Add a new config supervisor.slots to set the number of workers per supervisor. 2. Add class ServerSocketFactory to reserve the port for worker. You can merge this pull request into a Git repository by running: $ git pull https://github.com/clockfly/incubator-storm worker-dynamic-port Alternatively you can review and apply these changes as the patch at: https://github.com/apache/storm/pull/45.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #45 commit 5350a04e47deaa71890708c8719998a930e7446d Author: Sean Zhong clock...@gmail.com Date: 2014-03-06T03:36:21Z Allow storm worker to use dynamic port instead of pre-configed. Add a new config like supervisor.slots: 3 to config the number of workers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Update README.md
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/292#issuecomment-61601001 @lukedemi Please add the JIRA number to the title of the pull request? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-61410622 Hi HeartSaVioR, For acked topology, there is at least once delivery gurantee. When a tuple is dropped, the tuple cached at spout side will timeout, and it will report a failure to storm UI, it will be shown as failed count. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-61373371 I traced back the bug of message loss. I found this issue is introduced by storm-350. (https://github.com/apache/storm/pull/134/files#diff-4cbcb6fa47274d6cf66d70585f98cbefR202) After upgrading from disruptor 2.10.1 to 3.2.1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-61392411 When I reverted STORM-350, and test it again. There is no more message loss. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-61393103 High Availability test === test scenario: 4 machine A,B,C,D, 4 worker, 1 worker on each machine test case1(STORM-404): on machine A, kill worker. A will create a new worker taking the same port. expected result: reconnection will succeed. experiment result: other worker will start to reconnect, eventually it succeed. Because A starts a new worker with same port. ``` 2014-11-02T09:31:24.988+0800 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-IDHV22-04/192.168.1.54:6703... [84] 2014-11-02T09:31:25.498+0800 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-IDHV22-04/192.168.1.54:6703... [85] 2014-11-02T09:31:25.498+0800 b.s.m.n.Client [INFO] connection established to a remote host Netty-Client-IDHV22-04/192.168.1.54:6703, [id: 0x54466bab, /192.168.1.51:51336 = IDHV22-04/192.168.1.54:6703] ``` test case2(STORM-404): on machine A, kill worker, then immediately start a process to occupy the port used by the worker, which will force storm to relocate the worker to a new port(or a new machine.) -- expected result: reconnection process will fail, becasue storm relocate the worker to a new port. Actual result: First after many reconnecton try, the reconnection is aborted, no exception thrown ``` 2014-11-02T09:31:14.753+0800 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-IDHV22-04/192.168.1.54:6703... [63] 2014-11-02T09:31:18.065+0800 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-IDHV22-04/192.168.1.54:6703... [70] at org.apache.storm.netty.handler.codec.oneone.OneToOneEncoder.doEncode(OneToOneEncoder.java:71) ~[storm-core-0.9.3-rc2-SNAPSHOT.jar:0.9.3-rc2-SNAPSHOT] ... 2014-11-02T09:45:36.209+0800 b.s.m.n.Client [INFO] Waiting for pending batchs to be sent with Netty-Client-IDHV22-04/192.168.1.54:6703..., timeout: 60ms, pendings: 0 2014-11-02T09:45:36.209+0800 b.s.m.n.Client [INFO] connection is closing, abort reconnecting... ``` Second, a new connection to new worker(with new port, or on another machine) (previous the worker is at IDHV22-04:6703, then relocate to IDHV22-03:6702) ``` 2014-11-02T09:45:36.206+0800 b.s.m.n.Client [INFO] New Netty Client, connect to IDHV22-03, 6702, config: , buffer_size: 5242880 2014-11-02T09:45:36.207+0800 b.s.m.n.Client [INFO] connection established to a remote host Netty-Client-IDHV22-03/192.168.1.53:6702, [id: 0x538fdacb, /192.168.1.51:56047 = IDHV22-03/192.168.1.53:6702] ``` test case3: check the failed message count before and after the worker crash expect result: after the worker crash, there will some message loss. After it stablize, the message loss will not increase. Actual result: meet expectation. test case4: check the throughput change before and after the worker crash -- expect result: There should be no performance drop. Actual result: When storm start a new worker on same machine, there is no performance drop. Check the first gap in the following image. ![network bandwidth change before and after worker crash](https://issues.apache.org/jira/secure/attachment/12678758/worker-kill-recover3.jpg) When storm start a new worker on different machine. It may impact the parallism. Check the second gap in above picture. Before worker crash, there are 4 worker on 4 machine. After worker crash, there are 3 worker on 4 machine. The parallism drops, so the throughput drops. test case5(STORM-510): when a target worker crash, the message sending to other workers should not be blocked. expect result: One connection should not block another in the case of worker crash. Actual result: In the code, the blocking logic is removed. So, one connection will not block another connection. However, in the transition period of failure, because there will be many message loss to the crashed worker, the max.spout.pending flow control may kicks in, the spout message sending speed will be slower. And overall the max throughput will be smaller. After the transition, it goes back to normal. In my test, the transition peroid is around 40second. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-61393801 Summary of the test: UT pass STORM-404 pass STORM-510 pass Performacne regression pass --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-61393817 Hi Ted, Can you try this on your live cluster and contribute some real case test result? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-61286844 About performance test: === I tested the performance of new patch. It has no sigificant difference with storm-0.92. About STORM-404 chained crash issue(one worker cause another worker to crash) With this patch, the reconnection is successfully aborted. And new connection is established. ``` 2014-10-31T23:00:11.738+0800 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-IDHV22-04/192.168.1.54:6703... [30] 2014-10-31T23:00:12.738+0800 b.s.m.n.Client [INFO] Closing Netty Client Netty-Client-IDHV22-04/192.168.1.54:6703 2014-10-31T23:00:12.739+0800 b.s.m.n.Client [INFO] Waiting for pending batchs to be sent with Netty-Client-IDHV22-04/192.168.1.54:6703..., timeout: 60ms, pendings: 0 2014-10-31T23:00:32.754+0800 o.a.s.c.r.ExponentialBackoffRetry [WARN] maxRetries too large (30). Pinning to 29 2014-10-31T23:00:32.754+0800 b.s.u.StormBoundedExponentialBackoffRetry [INFO] The baseSleepTimeMs [100] the maxSleepTimeMs [1000] the maxRetries [30] 2014-10-31T23:00:32.754+0800 b.s.m.n.Client [INFO] New Netty Client, connect to IDHV22-01, 6702, config: , buffer_size: 5242880 2014-10-31T23:00:32.754+0800 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-IDHV22-01/192.168.1.51:6702... [0] 2014-10-31T23:00:32.755+0800 b.s.m.n.Client [INFO] connection established to a remote host Netty-Client-IDHV22-01/192.168.1.51:6702, [id: 0x4f7eb44b, /192.168.1.51:56592 = IDHV22-01/192.168.1.51:6702] ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-61287291 During this test, I found there was message loss, but it was not caused by this new patch. I traced back, seems the fault is introduced with changes between 0.9.2 and 0.9.3-rc1. I am still trying to find the root cause for this error. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-60870953 @tedxia I got a chance to chat with Ted online. In summary, he is descrbing the following case ï¼worker A - worker B): 1. B dies 2. after zk session timeout, zk knows B is dead 3. A is initiating the reconnection process to B. By default, it will retry 300 times at max.(it should be larger than 120second, based on the comments in config) â ``` # Since nimbus.task.launch.secs and supervisor.worker.start.timeout.secs are 120, other workers should also wait at least that long before giving up on connecting to the other worker.```â 4. zk is under heavy load(consider a zk tree which have 100 thoudsands nodes, and many many watchers), may take minutes to notify A that B is dead. 5. A didn't get notification from zk in time after 300 connection retries, reconnection failedï¼ it throws, which will cause the worker to exit. Basically there are two questions asked. First, whether we can assure the zookeeper is responsive( 1minute). Second, If worker doesn't get update of B from zookeeper after 300 reconnection retries, should we exit the worker or let worker continues to work? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-60871259 ``` If worker doesn't get update of B from zookeeper after 300 reconnection retries, should we exit the worker or let worker continues to work? ``` Current approach is: - if worker A get update of B from zk, it will abort the reconnection. and worker A will still stay alive and working. - if A doesn't get update from zk that B is dead, after timeout of 300 reconnection retries. then A will exit. In my opinion, there is no way to recover but exit. because a) A belive it must have live connection to B because application tell it that b) A cannot setup connection to B after exhausting every effort. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user clockfly commented on a diff in the pull request: https://github.com/apache/storm/pull/268#discussion_r19519682 --- Diff: storm-core/src/clj/backtype/storm/daemon/worker.clj --- @@ -378,9 +392,15 @@ _ (refresh-connections nil) _ (refresh-storm-active worker nil) + +receive-thread-shutdown (launch-receive-thread worker) + +;; make sure all messaging connections are ready for sending data, the netty messaging +;; client will drop the messages if client.send(msg) is called before the connection +;; is established. +_ (wait-messaging-connections-to-be-ready worker) --- End diff -- The wait here is to solve a corner case. When the topology boots up. Previsous approach will start spout immediately, which will use the IConnection layer before the IConnection is available. The wait-messaging-connections-to-be-ready is to ensure that the connection layer is ready before starting spout. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user clockfly commented on a diff in the pull request: https://github.com/apache/storm/pull/268#discussion_r19065812 --- Diff: storm-core/src/clj/backtype/storm/daemon/worker.clj --- @@ -330,6 +330,20 @@ (.send drainer node+port-socket))) (.clear drainer)) +;; Check whether this messaging connection is ready to send data +(defn is-connection-ready [^IConnection connection] + (if (instance? ConnectionWithStatus connection) --- End diff -- Here we have a type check (instance?), which may not be clean. To change IConnection may be cleaner, but that may bring compatible issues(support the user already implemented some messaging with current IConnection declaration) If the messaging connection has extended ConnectionWithStatus, then we will check the status. If the messaging connection has NOT extended ConnectionWithStatus, then we behavior in the same as before. (We will skip this check) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-59487520 Support we are sending data from worker A to worker B, to solve STORM-404(Worker on one machine crashes due to a failure of another worker on another machine), I think we can adopt the following logics: case1: when B is down: 1. B is lost, but A is still belive B is alive. 2. A try to send data to B, and then it triggers reconnect 3. The Nimbus find B is lost, and notify A. 4. A got notification that B is down, it will need to interrupt the reconnection of step 2( by closing the connection) 5. The reconnection of step 2 is interuppted, it exit. it will not throw RuntimeException. The key change is at step 4. A need to interrupt the reconnection to an obsolete worker. case 2: when B is alive, but the connection from A to B is down 1. A trigger reconnection logic 2. reconnection timeout 3. A cannot handle this failure, A throws RuntimeException. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---