[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user ptgoetz commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-76072022 Patch ported and merged to 0.9.x branch. Benchmarked several topologies (both core and Trident) before (0.9.3) and after (0.9.4-SNAPSHOT) this patch and found no performance regressions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user 3in commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-76130337 @ptgoetz If you have the resources, testing an RC and reporting back is the best way to accelerate a release. - where do i get the rc, i will have a try, https://github.com/ptgoetz/storm/tree/storm-329-mergenot existed thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-75713928 Oh, Taylor. Could you also update STORM-404 and STORM-510 as appropriate? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user ptgoetz commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-75891084 @danielschonfeld Soon. ;) We are probably a handful of weeks out, hopefully less. If you have the resources, testing an RC and reporting back is the best way to accelerate a release. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user danielschonfeld commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-75878755 @ptgoetz sorry for being the newbie here, but does this mean a new jar will be built of 0.9.3 that will include this? or will we have to wait for 0.10.0? we could really use this fix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user ptgoetz commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-75867311 I finally got this successfully back ported to the 0.9.x branch, with all tests passing. I will merge that soon after more testing and update all associated JIRAs. Thanks for all the effort that went into this! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user ptgoetz commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-75879324 @miguno yep, that's what I meant. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user danielschonfeld commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-75881077 @ptgoetz what is the time frame for 0.9.4? (roughly) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user ptgoetz commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-75879629 @danielschonfeld No worries. There will be a 0.9.4 release that includes this fix. No need to wait for 0.10.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-75634123 Phew. :-) Thanks for merging, Taylor! On 23.02.2015, at 22:06, P. Taylor Goetz notificati...@github.com wrote: Disregard last message. It was a merge mistake (picked right when I should have picked left). All tests are passing now. â Reply to this email directly or view it on GitHub. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user asfgit closed the pull request at: https://github.com/apache/storm/pull/429 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user ptgoetz commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-75631663 Disregard last message. It was a merge mistake (picked right when I should have picked left). All tests are passing now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user nathanmarz commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-74914806 Nimbus only knows a worker is having trouble when it stops sending heartbeats. If a worker gets into a bad state, the worst thing to do is have it continue trying to limp along in that bad state. It should instead suicide as quickly as possible. It seems counterintuitive, but this aggressive suiciding behavior actually makes things more robust as it prevents processes from getting into weird, potentially undefined states. This has been a crucial design principle in Storm from the beginning. One consequence of it is that any crucial system thread that receives an unrecoverable exception must suicide the process rather than die quietly. For the connection retry problem, it's a tricky situation since it may not be able to connect because the other worker is still getting set up. So the retry policy should be somehow related to the launch timeouts for worker processes specified in the configuration. Not being able to connect after the launch timeout + a certain number of attempts + a buffer period would certainly qualify as a weird state, so the process should suicide in that case. *Suiciding and restarting gets the worker back to a known state*. So in this case, I am heavily in favor of Option 2. I don't care about killing the other tasks in the worker because this is a rare situation. It is infinitely more important to get the worker back to a known, robust state than risk leaving it in a weird state permanently. I would like to see these issues addressed as part of this patch. @miguno Thanks for the explanation on this patch's relation to backpressure â we'll handle that in a future patch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-74927415 @nathanmarz Thanks for the detailed feedback on the max-retries issue. As Bobby suggested, would you mind if we decouple the work on max-retries (tracked at STORM-677) from this pull request / STORM-329? The max-retries issue has been a problem in Storm for a while, since before this pull request. Decoupling would also help us with backtracking any future issues to either ticket. Right now, I feel there's almost too much meat in this pull request already (as we kinda covered STORM-329, STORM-404, and STORM-510 in one big swing). Would that be ok for you? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-74894542 FYI: I created [STORM-677: Maximum retries strategy may cause data loss](https://issues.apache.org/jira/browse/STORM-677) to address the issue that Bobby brought up in this discussion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-74895041 PS: We may also want to update the original [STORM-329](https://issues.apache.org/jira/browse/STORM-329) ticket description to reflect the changes in this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user ptgoetz commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-74885134 +1 I was able to verify the fix, and am in favor of merging. I'd also like to apply it to the 0.9.x branch as I feel it's an important fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-74898690 Thanks, Taylor! Let me know if I can help with sorting out the test failures. Also regarding JIRA: I forgot to mention that it looks like we need to update STORM-404 and STORM-510 as well as STORM-329 has been said to cover those issues, too. It looks to me like this is actually the case, but I'd prefer another pair of eyes to be sure we're not mistakenly closing two related tickets. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-74702215 And FWIW, with the code in this PR the total test suite takes about 5mins to complete. ``` $ mvn clean install ... [INFO] [INFO] Reactor Summary: [INFO] [INFO] Storm .. SUCCESS [ 1.696 s] [INFO] maven-shade-clojure-transformer SUCCESS [ 1.930 s] [INFO] storm-maven-plugins SUCCESS [ 2.033 s] [INFO] Storm Core . SUCCESS [03:56 min] [INFO] storm-starter .. SUCCESS [ 6.616 s] [INFO] storm-kafka SUCCESS [ 48.469 s] [INFO] storm-hdfs . SUCCESS [ 1.875 s] [INFO] storm-hbase SUCCESS [ 2.271 s] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 05:02 min [INFO] Finished at: 2015-02-17T17:49:19+01:00 [INFO] Final Memory: 67M/282M [INFO] ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user revans2 commented on a diff in the pull request: https://github.com/apache/storm/pull/429#discussion_r24824540 --- Diff: storm-core/src/jvm/backtype/storm/messaging/netty/Client.java --- @@ -42,344 +42,577 @@ import org.slf4j.LoggerFactory; import backtype.storm.Config; +import backtype.storm.messaging.ConnectionWithStatus; import backtype.storm.metric.api.IStatefulObject; -import backtype.storm.messaging.IConnection; import backtype.storm.messaging.TaskMessage; import backtype.storm.utils.StormBoundedExponentialBackoffRetry; import backtype.storm.utils.Utils; -public class Client implements IConnection, IStatefulObject{ +/** + * A Netty client for sending task messages to a remote destination (Netty server). + * + * Implementation details: + * + * - Sending messages, i.e. writing to the channel, is performed asynchronously. + * - Messages are sent in batches to optimize for network throughput at the expense of network latency. The message + * batch size is configurable. + * - Connecting and reconnecting are performed asynchronously. + * - Note: The current implementation drops any messages that are being enqueued for sending if the connection to + * the remote destination is currently unavailable. + * - A background flusher thread is run in the background. It will, at fixed intervals, check for any pending messages + * (i.e. messages buffered in memory) and flush them to the remote destination iff background flushing is currently + * enabled. + */ +public class Client extends ConnectionWithStatus implements IStatefulObject { + private static final Logger LOG = LoggerFactory.getLogger(Client.class); private static final String PREFIX = Netty-Client-; -private final int max_retries; -private final int base_sleep_ms; -private final int max_sleep_ms; +private static final long NO_DELAY_MS = 0L; +private static final long MINIMUM_INITIAL_DELAY_MS = 3L; +private static final long PENDING_MESSAGES_FLUSH_TIMEOUT_MS = 60L; +private static final long PENDING_MESSAGES_FLUSH_INTERVAL_MS = 1000L; +private static final long DISTANT_FUTURE_TIME_MS = Long.MAX_VALUE; + private final StormBoundedExponentialBackoffRetry retryPolicy; -private AtomicReferenceChannel channelRef; private final ClientBootstrap bootstrap; -private InetSocketAddress remote_addr; - -private AtomicInteger totalReconnects; -private AtomicInteger messagesSent; -private AtomicInteger messagesLostReconnect; -private final Random random = new Random(); -private final ChannelFactory factory; -private final int buffer_size; -private boolean closing; - -private int messageBatchSize; - -private AtomicLong pendings; - -Map storm_conf; +private final InetSocketAddress dstAddress; +protected final String dstAddressPrefixedName; + +/** + * The channel used for all write operations from this client to the remote destination. + */ +private final AtomicReferenceChannel channelRef = new AtomicReferenceChannel(null); + + +/** + * Maximum number of reconnection attempts we will perform after a disconnect before giving up. + */ +private final int maxReconnectionAttempts; --- End diff -- I personally prefer option 3, no maximum number of reconnection attempts. Having the client decide that it is done, before nimbus does feels like it is asking for trouble. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-74701138 I am seeing a lot of tests timing out with this change. Has anyone else seen this? Hmm. All the tests are passing for me (and they have been since a while). Do you have any pointers? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user revans2 commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-74702976 I don't it could just be my mac acting funny. I'll dig into it, but don't block the pull request on me. I saw similar things on a different pull request that corrected themselves spontaneously. My guess right now is that it could have something to do with being on battery power vs being plugged into the wall. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user revans2 commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-74696132 I am +1 for the code as is. Perhaps a separate JIRA for the reconnection attempts would be best. The issue has been in for quite a while now, and this code was not intended to fix it to begin with. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user revans2 commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-74707726 OK It is something odd with my mac. It looks like the difference is wired vs wireless networking :), or possibly even switching between the two. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno commented on a diff in the pull request: https://github.com/apache/storm/pull/429#discussion_r24738502 --- Diff: storm-core/src/jvm/backtype/storm/messaging/netty/Client.java --- @@ -42,344 +42,577 @@ import org.slf4j.LoggerFactory; import backtype.storm.Config; +import backtype.storm.messaging.ConnectionWithStatus; import backtype.storm.metric.api.IStatefulObject; -import backtype.storm.messaging.IConnection; import backtype.storm.messaging.TaskMessage; import backtype.storm.utils.StormBoundedExponentialBackoffRetry; import backtype.storm.utils.Utils; -public class Client implements IConnection, IStatefulObject{ +/** + * A Netty client for sending task messages to a remote destination (Netty server). + * + * Implementation details: + * + * - Sending messages, i.e. writing to the channel, is performed asynchronously. + * - Messages are sent in batches to optimize for network throughput at the expense of network latency. The message + * batch size is configurable. + * - Connecting and reconnecting are performed asynchronously. + * - Note: The current implementation drops any messages that are being enqueued for sending if the connection to + * the remote destination is currently unavailable. + * - A background flusher thread is run in the background. It will, at fixed intervals, check for any pending messages + * (i.e. messages buffered in memory) and flush them to the remote destination iff background flushing is currently + * enabled. + */ +public class Client extends ConnectionWithStatus implements IStatefulObject { + private static final Logger LOG = LoggerFactory.getLogger(Client.class); private static final String PREFIX = Netty-Client-; -private final int max_retries; -private final int base_sleep_ms; -private final int max_sleep_ms; +private static final long NO_DELAY_MS = 0L; +private static final long MINIMUM_INITIAL_DELAY_MS = 3L; +private static final long PENDING_MESSAGES_FLUSH_TIMEOUT_MS = 60L; +private static final long PENDING_MESSAGES_FLUSH_INTERVAL_MS = 1000L; +private static final long DISTANT_FUTURE_TIME_MS = Long.MAX_VALUE; + private final StormBoundedExponentialBackoffRetry retryPolicy; -private AtomicReferenceChannel channelRef; private final ClientBootstrap bootstrap; -private InetSocketAddress remote_addr; - -private AtomicInteger totalReconnects; -private AtomicInteger messagesSent; -private AtomicInteger messagesLostReconnect; -private final Random random = new Random(); -private final ChannelFactory factory; -private final int buffer_size; -private boolean closing; - -private int messageBatchSize; - -private AtomicLong pendings; - -Map storm_conf; +private final InetSocketAddress dstAddress; +protected final String dstAddressPrefixedName; + +/** + * The channel used for all write operations from this client to the remote destination. + */ +private final AtomicReferenceChannel channelRef = new AtomicReferenceChannel(null); + + +/** + * Maximum number of reconnection attempts we will perform after a disconnect before giving up. + */ +private final int maxReconnectionAttempts; + +/** + * Total number of connection attempts. + */ +private final AtomicInteger totalConnectionAttempts = new AtomicInteger(0); + +/** + * Number of connection attempts since the last disconnect. + */ +private final AtomicInteger connectionAttempts = new AtomicInteger(0); + +/** + * Number of messages successfully sent to the remote destination. + */ +private final AtomicInteger messagesSent = new AtomicInteger(0); + +/** + * Number of messages that could not be sent to the remote destination. + */ +private final AtomicInteger messagesLost = new AtomicInteger(0); + +/** + * Number of messages buffered in memory. + */ +private final AtomicLong pendingMessages = new AtomicLong(0); + +/** + * This flag is set to true if and only if a client instance is being closed. + */ +private volatile boolean closing = false; + +/** + * When set to true, then the background flusher thread will flush any pending messages on its next run. + */ +private final AtomicBoolean backgroundFlushingEnabled = new AtomicBoolean(false); + +/** + * The absolute time (in ms) when the next background flush should be performed. + * + * Note: The
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user revans2 commented on a diff in the pull request: https://github.com/apache/storm/pull/429#discussion_r24680701 --- Diff: storm-core/src/jvm/backtype/storm/messaging/netty/Client.java --- @@ -42,344 +42,577 @@ import org.slf4j.LoggerFactory; import backtype.storm.Config; +import backtype.storm.messaging.ConnectionWithStatus; import backtype.storm.metric.api.IStatefulObject; -import backtype.storm.messaging.IConnection; import backtype.storm.messaging.TaskMessage; import backtype.storm.utils.StormBoundedExponentialBackoffRetry; import backtype.storm.utils.Utils; -public class Client implements IConnection, IStatefulObject{ +/** + * A Netty client for sending task messages to a remote destination (Netty server). + * + * Implementation details: + * + * - Sending messages, i.e. writing to the channel, is performed asynchronously. + * - Messages are sent in batches to optimize for network throughput at the expense of network latency. The message + * batch size is configurable. + * - Connecting and reconnecting are performed asynchronously. + * - Note: The current implementation drops any messages that are being enqueued for sending if the connection to + * the remote destination is currently unavailable. + * - A background flusher thread is run in the background. It will, at fixed intervals, check for any pending messages + * (i.e. messages buffered in memory) and flush them to the remote destination iff background flushing is currently + * enabled. + */ +public class Client extends ConnectionWithStatus implements IStatefulObject { + private static final Logger LOG = LoggerFactory.getLogger(Client.class); private static final String PREFIX = Netty-Client-; -private final int max_retries; -private final int base_sleep_ms; -private final int max_sleep_ms; +private static final long NO_DELAY_MS = 0L; +private static final long MINIMUM_INITIAL_DELAY_MS = 3L; +private static final long PENDING_MESSAGES_FLUSH_TIMEOUT_MS = 60L; +private static final long PENDING_MESSAGES_FLUSH_INTERVAL_MS = 1000L; +private static final long DISTANT_FUTURE_TIME_MS = Long.MAX_VALUE; + private final StormBoundedExponentialBackoffRetry retryPolicy; -private AtomicReferenceChannel channelRef; private final ClientBootstrap bootstrap; -private InetSocketAddress remote_addr; - -private AtomicInteger totalReconnects; -private AtomicInteger messagesSent; -private AtomicInteger messagesLostReconnect; -private final Random random = new Random(); -private final ChannelFactory factory; -private final int buffer_size; -private boolean closing; - -private int messageBatchSize; - -private AtomicLong pendings; - -Map storm_conf; +private final InetSocketAddress dstAddress; +protected final String dstAddressPrefixedName; + +/** + * The channel used for all write operations from this client to the remote destination. + */ +private final AtomicReferenceChannel channelRef = new AtomicReferenceChannel(null); + + +/** + * Maximum number of reconnection attempts we will perform after a disconnect before giving up. + */ +private final int maxReconnectionAttempts; + +/** + * Total number of connection attempts. + */ +private final AtomicInteger totalConnectionAttempts = new AtomicInteger(0); + +/** + * Number of connection attempts since the last disconnect. + */ +private final AtomicInteger connectionAttempts = new AtomicInteger(0); + +/** + * Number of messages successfully sent to the remote destination. + */ +private final AtomicInteger messagesSent = new AtomicInteger(0); + +/** + * Number of messages that could not be sent to the remote destination. + */ +private final AtomicInteger messagesLost = new AtomicInteger(0); + +/** + * Number of messages buffered in memory. + */ +private final AtomicLong pendingMessages = new AtomicLong(0); + +/** + * This flag is set to true if and only if a client instance is being closed. + */ +private volatile boolean closing = false; + +/** + * When set to true, then the background flusher thread will flush any pending messages on its next run. + */ +private final AtomicBoolean backgroundFlushingEnabled = new AtomicBoolean(false); + +/** + * The absolute time (in ms) when the next background flush should be performed. + * + * Note: The
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user danielschonfeld commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-74130344 Doesn't dropping the messages coming from a non ack/fail caring spout negate the 'at least once' attempt of storm? I mean doesn't that kinda force you to make all your spouts ack/fail aware where before you could have gotten away without it? In other words. There is a chance that if the worker that died is the one containing the spout and if the first bolt is located on another worker, that technically at-least once wasn't tried but rather fell to the floor right away. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-74147926 If you need at-least-once processing you must use an acking topology, which will allow Storm to replay lost messages. If instead you go with an unacking topology (= no guaranteed message processing) then you may run into data loss. There re pros and cons for each variant, and e.g. in our case we use both depending on the use case. Also: The semantics described above have been in Storm right from the beginning. None of these have been changed by this pull request. On 12.02.2015, at 20:01, Daniel Schonfeld notificati...@github.com wrote: Doesn't dropping the messages coming from a non ack/fail caring spout negate the 'at least once' attempt of storm? I mean doesn't that kinda force you to make all your spouts ack/fail aware where before you could have gotten away without it? In other words. There is a chance that if the worker that died is the one containing the spout and if the first bolt is located on another worker, that technically at-least once wasn't tried but rather fell to the floor right away. â Reply to this email directly or view it on GitHub. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user nathanmarz commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-74036235 I retract my earlier -1. It was mentioned that this enables backpressure for unacked topologies. Is this the case? If so, this is a great new feature of Storm and there should be tests added testing this behavior. Namely, it should test that: - Spouts stop emitting when all buffers fill up - Topology recovers on worker death and only a subset of messages are dropped in that case --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-74084548 +1 On Thu, Feb 12, 2015 at 6:44 PM, Michael G. Noll notificati...@github.com wrote: Thanks for your feedback, Nathan. As far as I understand this patch does not enable backpressure. But: because there is no backpressure (yet) that we can rely on, this patch will improve at least the situation during the startup phase of a topology to prevent that a) an unacked topo will not lose messages during the startup, and b) we do not need to unnecessarily replay messages in case of acked topos during their startup. This is achieved by checking that all worker connections are ready before the topology starts processing data. So backpressure is still an open feature. Backpressure was IIRC mentioned in the initial PR because there was a deficiency (dating back to a ZMQ related TODO) that caused problems related to this PR/Storm tickets (327, 404, and one more). However, this patch does make the best of the current situation even in the absence of backpressure. But first and foremost this patch fixes a (critical) cascading failure that can bring Storm clusters to a halt. Please correct me if I'm mistaken in my summary. â Reply to this email directly or view it on GitHub https://github.com/apache/storm/pull/429#issuecomment-74050954. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-74032874 This patch allows a worker to properly detect that the connection to a peer becomes unavailable -- for whatever reason (the remote worker is dead or restarting, there was a network glitch, etc). Also, any reconnection attempts are now async so that reconnecting will not block other activities of the worker (such as sending messages to other workers it is still connected to). So to your question: this patch includes the case of the worker that remained alive trying to (re)connect to a dead peer. Does that help? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/428#issuecomment-73925561 Something like git rebase -i upstream/master --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
GitHub user miguno opened a pull request: https://github.com/apache/storm/pull/428 STORM-329: fix cascading Storm failure by improving reconnection strategy and buffering messages This is an improved version of the original pull request discussed at https://github.com/apache/storm/pull/268. Please refer to the discussion in the link above. **Note**: Please give attribution to @tedxia when merging the pull request as he did a lot (most?) of the work in this pull request. The changes of this pull request include: - Most importantly, we fix a bug in Storm that may cause a cascading failure in a Storm cluster, to the point where the whole cluster becomes unusable. This is achieved by the work described in the next bullet points. - We refactor and improve the Netty messaging backend, notably the client. - During the initial startup of a topology, Storm will now wait until worker (Netty) connections are ready for operation. See the [original discussion thread](https://github.com/apache/storm/pull/268) for the detailed explanation and justification of this change. @clockfly, @tedxia: Please add any further comments to STORM-329 to this pull request, if possible. You can merge this pull request into a Git repository by running: $ git pull https://github.com/miguno/storm 0.10.0-SNAPSHOT-STORM-329 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/storm/pull/428.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #428 commit 8c52a5b518021a6beff372acbeb66a963a1d4f74 Author: xiajun xia...@xiaomi.com Date: 2014-09-24T12:39:18Z STORM-329 : buffer message in client and reconnect remote server async commit 7cc8c8b1b59d415f1fa54e081127336fcdaeb706 Author: xiajun xia...@xiaomi.com Date: 2014-09-26T02:43:55Z STORM-329: fix continue flush after client had been closed commit 9826bc9ef7aee8c83e90d56f52ac70ac7165d769 Author: xiajun xia...@xiaomi.com Date: 2014-09-30T03:25:08Z Client not clean timeout TaskMessage commit 978c969fdb4b5904b6a87c100fbd80fe26bf39cf Author: Sean Zhong clock...@gmail.com Date: 2014-10-20T01:08:43Z Merge remote-tracking branch 'upstream/master' commit 44f8260bbf489c6a2741fb4d8f9196ea6ddb51cc Author: Sean Zhong clock...@gmail.com Date: 2014-10-20T01:21:25Z STORM-404, STORM-510, STORM-329: 1. break the reconnection to target if worker is informed that the target is the down, we that we avoid RuntimeException when reconnection failed. 2. When worker get started, need to make sure all target workers are alive before launching spouts 3. When target worker is down, all messages sending to the target worker will be dropped. commit dea5fbe35c4d9b18a89dae320f9fc985f25bd31a Author: Sean Zhong clock...@gmail.com Date: 2014-10-20T01:24:45Z STORM-329: fix comment grammar commit 16de9f3321827624865a4450e93bd53efb75ed93 Author: Sean Zhong clock...@gmail.com Date: 2014-10-28T03:31:23Z test commit e8dcf9155c85d3541c9352faf9a0651614b93eb6 Author: Sean Zhong clock...@gmail.com Date: 2014-10-29T01:56:19Z STORM-329: avoid deadlock commit 22e7014dedfd580de9dd1d6b2083c8fb3d77d406 Author: Sean Zhong clock...@gmail.com Date: 2014-10-29T01:59:09Z Revert test This reverts commit 16de9f3321827624865a4450e93bd53efb75ed93. commit ddef6667cdc6de9aebf0d4006ad9e7df2bfbb3bb Author: Sean Zhong clock...@gmail.com Date: 2014-10-29T08:14:34Z Merge remote-tracking branch 'upstream/master' commit baf3c628db3899def89ee92c752524d041bc8b40 Author: Sean Zhong clock...@gmail.com Date: 2014-10-30T10:17:07Z STORM-329: fix UT. Add a new flag in worker data worker-active-flag, Wait connections to be ready asyncly. commit e1c463f5681dbf6a66c868d467aca14064da1e9b Author: Sean Zhong clock...@gmail.com Date: 2014-10-30T10:22:22Z STORM-329: fix comments. Add a description about storm.messaging.netty.max_retries, that the reconnection period should also be bigger than storm.zookeeper.session.timeout, so that the reconnection can be aborted(when target worker is dead) before the reconnection failed and throw RunTimeException commit 2d3fad121481da40258af27e6d7fbcb148365e76 Author: Sean Zhong clock...@gmail.com Date: 2014-10-30T10:22:22Z STORM-329: fix a integration issue commit 60f04f9e397cf49e4fba6fe6a2f0bfb23d5a8605 Author: Sean Zhong clock...@gmail.com Date: 2014-10-30T10:28:17Z Merge branch 'master' of https://github.com/tedxia/incubator-storm commit 41aafbecac2cf3295255c7dc9b299b8c0c555390 Author: xiajun xia...@xiaomi.com Date: 2014-11-18T06:53:14Z Merge remote-tracking branch 'remotes/apache-storm/0.9.3-branch' into ted-master Conflicts: storm-core/src/jvm/backtype/storm/messaging/netty/Client.java commit 2c39866cf8bbca136d3b88f796b9f847b282fdd7 Author: xiajun
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/428#issuecomment-73932513 I see, there are multiple remote-merging, which then make rebase impossible. How about create a patch file against master and then apply the patch with new commit logs? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/428#issuecomment-73946399 I did exactly this in #429. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno closed the pull request at: https://github.com/apache/storm/pull/428 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/428#issuecomment-73928303 I tried rebasing (also to fix the incorrect commit message that starts with STORM-32*7*) but gave up after several failed attempts. Feel free to give it a try though -- maybe your git-fu is better than mine. :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
GitHub user miguno opened a pull request: https://github.com/apache/storm/pull/429 STORM-329: fix cascading Storm failure by improving reconnection strategy and buffering messages **This PR contains the same code as https://github.com/apache/storm/pull/428 but as a single commit for a cleaner commit history of our Storm repo.** -- This is an improved version of the original pull request discussed at https://github.com/apache/storm/pull/268. Please refer to the discussion in the link above. **Note**: Please give attribution to @tedxia when merging the pull request as he did a lot (most?) of the work in this pull request. The changes of this pull request include: - Most importantly, we fix a bug in Storm that may cause a cascading failure in a Storm cluster, to the point where the whole cluster becomes unusable. This is achieved by the work described in the next bullet points. - We refactor and improve the Netty messaging backend, notably the client. - During the initial startup of a topology, Storm will now wait until worker (Netty) connections are ready for operation. See the [original discussion thread](https://github.com/apache/storm/pull/268) for the detailed explanation and justification of this change. @clockfly, @tedxia: Please add any further comments to STORM-329 to this pull request, if possible. You can merge this pull request into a Git repository by running: $ git pull https://github.com/miguno/storm 0.10.0-SNAPSHOT-STORM-329-diff Alternatively you can review and apply these changes as the patch at: https://github.com/apache/storm/pull/429.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #429 commit 205eaf4ebe28ab5550a842ea9aabd23b41678743 Author: Michael G. Noll mn...@verisign.com Date: 2015-02-11T18:55:53Z STORM-329: fix cascading Storm failure by improving reconnection strategy and buffering messages Thanks to @tedxia for the initial work on this patch, which covered a lot if not most of the work! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user danielschonfeld commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-74026208 I admit to not understanding all the intricacies of the code as i'm still coming to terms with the different parts of storm. However, does this PR handle the case of the worker that remained alive trying to refresh a connection to a dead worker or a dead worker that is in the process of starting up? I seem to see the reverse being handled, but not in the direction mentioned above. Of course that could also be cause i don't understand the code fully. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---