[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-25 Thread ptgoetz
Github user ptgoetz commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-76072022
  
Patch ported and merged to 0.9.x branch.

Benchmarked several topologies (both core and Trident) before (0.9.3) and 
after (0.9.4-SNAPSHOT) this patch and found no performance regressions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-25 Thread 3in
Github user 3in commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-76130337
  
@ptgoetz 
If you have the resources, testing an RC and reporting back is the best way 
to accelerate a release.
-
where do i get the rc, i will have a try,
https://github.com/ptgoetz/storm/tree/storm-329-mergenot existed
thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-24 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-75713928
  
Oh, Taylor. Could you also update STORM-404 and STORM-510 as appropriate?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-24 Thread ptgoetz
Github user ptgoetz commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-75891084
  
@danielschonfeld Soon. ;)

We are probably a handful of weeks out, hopefully less.

If you have the resources, testing an RC and reporting back is the best way 
to accelerate a release.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-24 Thread danielschonfeld
Github user danielschonfeld commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-75878755
  
@ptgoetz sorry for being the newbie here, but does this mean a new jar will 
be built of 0.9.3 that will include this? or will we have to wait for 0.10.0? 

we could really use this fix 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-24 Thread ptgoetz
Github user ptgoetz commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-75867311
  
I finally got this successfully back ported to the 0.9.x branch, with all 
tests passing. I will merge that soon after more testing and update all 
associated JIRAs.

Thanks for all the effort that went into this!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-24 Thread ptgoetz
Github user ptgoetz commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-75879324
  
@miguno yep, that's what I meant.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-24 Thread danielschonfeld
Github user danielschonfeld commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-75881077
  
@ptgoetz what is the time frame for 0.9.4? (roughly)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-24 Thread ptgoetz
Github user ptgoetz commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-75879629
  
@danielschonfeld No worries. There will be a 0.9.4 release that includes 
this fix. No need to wait for 0.10.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-23 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-75634123
  
Phew. :-)

Thanks for merging, Taylor!



 On 23.02.2015, at 22:06, P. Taylor Goetz notificati...@github.com wrote:
 
 Disregard last message. It was a merge mistake (picked right when I 
should have picked left).
 
 All tests are passing now.
 
 —
 Reply to this email directly or view it on GitHub.
 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/storm/pull/429


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-23 Thread ptgoetz
Github user ptgoetz commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-75631663
  
Disregard last message. It was a merge mistake (picked right when I should 
have picked left).

All tests are passing now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-18 Thread nathanmarz
Github user nathanmarz commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-74914806
  
Nimbus only knows a worker is having trouble when it stops sending 
heartbeats. If a worker gets into a bad state, the worst thing to do is have it 
continue trying to limp along in that bad state. It should instead suicide as 
quickly as possible. It seems counterintuitive, but this aggressive suiciding 
behavior actually makes things more robust as it prevents processes from 
getting into weird, potentially undefined states. This has been a crucial 
design principle in Storm from the beginning. One consequence of it is that any 
crucial system thread that receives an unrecoverable exception must suicide the 
process rather than die quietly. 

For the connection retry problem, it's a tricky situation since it may not 
be able to connect because the other worker is still getting set up. So the 
retry policy should be somehow related to the launch timeouts for worker 
processes specified in the configuration. Not being able to connect after the 
launch timeout + a certain number of attempts + a buffer period would certainly 
qualify as a weird state, so the process should suicide in that case. 
*Suiciding and restarting gets the worker back to a known state*. 

So in this case, I am heavily in favor of Option 2. I don't care about 
killing the other tasks in the worker because this is a rare situation. It is 
infinitely more important to get the worker back to a known, robust state than 
risk leaving it in a weird state permanently.

I would like to see these issues addressed as part of this patch.

@miguno Thanks for the explanation on this patch's relation to backpressure 
– we'll handle that in a future patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-18 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-74927415
  
@nathanmarz Thanks for the detailed feedback on the max-retries issue.  As 
Bobby suggested, would you mind if we decouple the work on max-retries (tracked 
at STORM-677) from this pull request / STORM-329?  The max-retries issue has 
been a problem in Storm for a while, since before this pull request.  
Decoupling would also help us with backtracking any future issues to either 
ticket.  Right now, I feel there's almost too much meat in this pull request 
already (as we kinda covered STORM-329, STORM-404, and STORM-510 in one big 
swing).

Would that be ok for you?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-18 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-74894542
  
FYI: I created [STORM-677: Maximum retries strategy may cause data 
loss](https://issues.apache.org/jira/browse/STORM-677) to address the issue 
that Bobby brought up in this discussion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-18 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-74895041
  
PS: We may also want to update the original 
[STORM-329](https://issues.apache.org/jira/browse/STORM-329) ticket description 
to reflect the changes in this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-18 Thread ptgoetz
Github user ptgoetz commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-74885134
  
+1 I was able to verify the fix, and am in favor of merging. I'd also like 
to apply it to the 0.9.x branch as I feel it's an important fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-18 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-74898690
  
Thanks, Taylor!  Let me know if I can help with sorting out the test 
failures.

Also regarding JIRA: I forgot to mention that it looks like we need to 
update STORM-404 and STORM-510 as well as STORM-329 has been said to cover 
those issues, too.  It looks to me like this is actually the case, but I'd 
prefer another pair of eyes to be sure we're not mistakenly closing two related 
tickets.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-17 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-74702215
  
And FWIW, with the code in this PR the total test suite takes about 5mins 
to complete.

```
$ mvn clean install
...

[INFO] 

[INFO] Reactor Summary:
[INFO]
[INFO] Storm .. SUCCESS [  
1.696 s]
[INFO] maven-shade-clojure-transformer  SUCCESS [  
1.930 s]
[INFO] storm-maven-plugins  SUCCESS [  
2.033 s]
[INFO] Storm Core . SUCCESS [03:56 
min]
[INFO] storm-starter .. SUCCESS [  
6.616 s]
[INFO] storm-kafka  SUCCESS [ 
48.469 s]
[INFO] storm-hdfs . SUCCESS [  
1.875 s]
[INFO] storm-hbase  SUCCESS [  
2.271 s]
[INFO] 

[INFO] BUILD SUCCESS
[INFO] 

[INFO] Total time: 05:02 min
[INFO] Finished at: 2015-02-17T17:49:19+01:00
[INFO] Final Memory: 67M/282M
[INFO] 

```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-17 Thread revans2
Github user revans2 commented on a diff in the pull request:

https://github.com/apache/storm/pull/429#discussion_r24824540
  
--- Diff: storm-core/src/jvm/backtype/storm/messaging/netty/Client.java ---
@@ -42,344 +42,577 @@
 import org.slf4j.LoggerFactory;
 
 import backtype.storm.Config;
+import backtype.storm.messaging.ConnectionWithStatus;
 import backtype.storm.metric.api.IStatefulObject;
-import backtype.storm.messaging.IConnection;
 import backtype.storm.messaging.TaskMessage;
 import backtype.storm.utils.StormBoundedExponentialBackoffRetry;
 import backtype.storm.utils.Utils;
 
-public class Client implements IConnection, IStatefulObject{
+/**
+ * A Netty client for sending task messages to a remote destination (Netty 
server).
+ *
+ * Implementation details:
+ *
+ * - Sending messages, i.e. writing to the channel, is performed 
asynchronously.
+ * - Messages are sent in batches to optimize for network throughput at 
the expense of network latency.  The message
+ *   batch size is configurable.
+ * - Connecting and reconnecting are performed asynchronously.
+ * - Note: The current implementation drops any messages that are 
being enqueued for sending if the connection to
+ *   the remote destination is currently unavailable.
+ * - A background flusher thread is run in the background.  It will, at 
fixed intervals, check for any pending messages
+ *   (i.e. messages buffered in memory) and flush them to the remote 
destination iff background flushing is currently
+ *   enabled.
+ */
+public class Client extends ConnectionWithStatus implements 
IStatefulObject {
+
 private static final Logger LOG = 
LoggerFactory.getLogger(Client.class);
 private static final String PREFIX = Netty-Client-;
-private final int max_retries;
-private final int base_sleep_ms;
-private final int max_sleep_ms;
+private static final long NO_DELAY_MS = 0L;
+private static final long MINIMUM_INITIAL_DELAY_MS = 3L;
+private static final long PENDING_MESSAGES_FLUSH_TIMEOUT_MS = 60L;
+private static final long PENDING_MESSAGES_FLUSH_INTERVAL_MS = 1000L;
+private static final long DISTANT_FUTURE_TIME_MS = Long.MAX_VALUE;
+
 private final StormBoundedExponentialBackoffRetry retryPolicy;
-private AtomicReferenceChannel channelRef;
 private final ClientBootstrap bootstrap;
-private InetSocketAddress remote_addr;
-
-private AtomicInteger totalReconnects;
-private AtomicInteger messagesSent;
-private AtomicInteger messagesLostReconnect;
-private final Random random = new Random();
-private final ChannelFactory factory;
-private final int buffer_size;
-private boolean closing;
-
-private int messageBatchSize;
-
-private AtomicLong pendings;
-
-Map storm_conf;
+private final InetSocketAddress dstAddress;
+protected final String dstAddressPrefixedName;
+
+/**
+ * The channel used for all write operations from this client to the 
remote destination.
+ */
+private final AtomicReferenceChannel channelRef = new 
AtomicReferenceChannel(null);
+
+
+/**
+ * Maximum number of reconnection attempts we will perform after a 
disconnect before giving up.
+ */
+private final int maxReconnectionAttempts;
--- End diff --

I personally prefer option 3, no maximum number of reconnection attempts.  
Having the client decide that it is done, before nimbus does feels like it is 
asking for trouble.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-17 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-74701138
  
 I am seeing a lot of tests timing out with this change. Has anyone else 
seen this?

Hmm.  All the tests are passing for me (and they have been since a while).

Do you have any pointers?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-17 Thread revans2
Github user revans2 commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-74702976
  
I don't it could just be my mac acting funny.  I'll dig into it, but don't 
block the pull request on me.  I saw similar things on a different pull request 
that corrected themselves spontaneously.  My guess right now is that it could 
have something to do with being on battery power vs being plugged into the wall.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-17 Thread revans2
Github user revans2 commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-74696132
  
I am +1 for the code as is.  Perhaps a separate JIRA for the reconnection 
attempts would be best.  The issue has been in for quite a while now, and this 
code was not intended to fix it to begin with.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-17 Thread revans2
Github user revans2 commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-74707726
  
OK It is something odd with my mac. It looks like the difference is wired 
vs wireless networking :), or possibly even switching between the two.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-16 Thread miguno
Github user miguno commented on a diff in the pull request:

https://github.com/apache/storm/pull/429#discussion_r24738502
  
--- Diff: storm-core/src/jvm/backtype/storm/messaging/netty/Client.java ---
@@ -42,344 +42,577 @@
 import org.slf4j.LoggerFactory;
 
 import backtype.storm.Config;
+import backtype.storm.messaging.ConnectionWithStatus;
 import backtype.storm.metric.api.IStatefulObject;
-import backtype.storm.messaging.IConnection;
 import backtype.storm.messaging.TaskMessage;
 import backtype.storm.utils.StormBoundedExponentialBackoffRetry;
 import backtype.storm.utils.Utils;
 
-public class Client implements IConnection, IStatefulObject{
+/**
+ * A Netty client for sending task messages to a remote destination (Netty 
server).
+ *
+ * Implementation details:
+ *
+ * - Sending messages, i.e. writing to the channel, is performed 
asynchronously.
+ * - Messages are sent in batches to optimize for network throughput at 
the expense of network latency.  The message
+ *   batch size is configurable.
+ * - Connecting and reconnecting are performed asynchronously.
+ * - Note: The current implementation drops any messages that are 
being enqueued for sending if the connection to
+ *   the remote destination is currently unavailable.
+ * - A background flusher thread is run in the background.  It will, at 
fixed intervals, check for any pending messages
+ *   (i.e. messages buffered in memory) and flush them to the remote 
destination iff background flushing is currently
+ *   enabled.
+ */
+public class Client extends ConnectionWithStatus implements 
IStatefulObject {
+
 private static final Logger LOG = 
LoggerFactory.getLogger(Client.class);
 private static final String PREFIX = Netty-Client-;
-private final int max_retries;
-private final int base_sleep_ms;
-private final int max_sleep_ms;
+private static final long NO_DELAY_MS = 0L;
+private static final long MINIMUM_INITIAL_DELAY_MS = 3L;
+private static final long PENDING_MESSAGES_FLUSH_TIMEOUT_MS = 60L;
+private static final long PENDING_MESSAGES_FLUSH_INTERVAL_MS = 1000L;
+private static final long DISTANT_FUTURE_TIME_MS = Long.MAX_VALUE;
+
 private final StormBoundedExponentialBackoffRetry retryPolicy;
-private AtomicReferenceChannel channelRef;
 private final ClientBootstrap bootstrap;
-private InetSocketAddress remote_addr;
-
-private AtomicInteger totalReconnects;
-private AtomicInteger messagesSent;
-private AtomicInteger messagesLostReconnect;
-private final Random random = new Random();
-private final ChannelFactory factory;
-private final int buffer_size;
-private boolean closing;
-
-private int messageBatchSize;
-
-private AtomicLong pendings;
-
-Map storm_conf;
+private final InetSocketAddress dstAddress;
+protected final String dstAddressPrefixedName;
+
+/**
+ * The channel used for all write operations from this client to the 
remote destination.
+ */
+private final AtomicReferenceChannel channelRef = new 
AtomicReferenceChannel(null);
+
+
+/**
+ * Maximum number of reconnection attempts we will perform after a 
disconnect before giving up.
+ */
+private final int maxReconnectionAttempts;
+
+/**
+ * Total number of connection attempts.
+ */
+private final AtomicInteger totalConnectionAttempts = new 
AtomicInteger(0);
+
+/**
+ * Number of connection attempts since the last disconnect.
+ */
+private final AtomicInteger connectionAttempts = new AtomicInteger(0);
+
+/**
+ * Number of messages successfully sent to the remote destination.
+ */
+private final AtomicInteger messagesSent = new AtomicInteger(0);
+
+/**
+ * Number of messages that could not be sent to the remote destination.
+ */
+private final AtomicInteger messagesLost = new AtomicInteger(0);
+
+/**
+ * Number of messages buffered in memory.
+ */
+private final AtomicLong pendingMessages = new AtomicLong(0);
+
+/**
+ * This flag is set to true if and only if a client instance is being 
closed.
+ */
+private volatile boolean closing = false;
+
+/**
+ * When set to true, then the background flusher thread will flush any 
pending messages on its next run.
+ */
+private final AtomicBoolean backgroundFlushingEnabled = new 
AtomicBoolean(false);
+
+/**
+ * The absolute time (in ms) when the next background flush should be 
performed.
+ *
+ * Note: The 

[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-13 Thread revans2
Github user revans2 commented on a diff in the pull request:

https://github.com/apache/storm/pull/429#discussion_r24680701
  
--- Diff: storm-core/src/jvm/backtype/storm/messaging/netty/Client.java ---
@@ -42,344 +42,577 @@
 import org.slf4j.LoggerFactory;
 
 import backtype.storm.Config;
+import backtype.storm.messaging.ConnectionWithStatus;
 import backtype.storm.metric.api.IStatefulObject;
-import backtype.storm.messaging.IConnection;
 import backtype.storm.messaging.TaskMessage;
 import backtype.storm.utils.StormBoundedExponentialBackoffRetry;
 import backtype.storm.utils.Utils;
 
-public class Client implements IConnection, IStatefulObject{
+/**
+ * A Netty client for sending task messages to a remote destination (Netty 
server).
+ *
+ * Implementation details:
+ *
+ * - Sending messages, i.e. writing to the channel, is performed 
asynchronously.
+ * - Messages are sent in batches to optimize for network throughput at 
the expense of network latency.  The message
+ *   batch size is configurable.
+ * - Connecting and reconnecting are performed asynchronously.
+ * - Note: The current implementation drops any messages that are 
being enqueued for sending if the connection to
+ *   the remote destination is currently unavailable.
+ * - A background flusher thread is run in the background.  It will, at 
fixed intervals, check for any pending messages
+ *   (i.e. messages buffered in memory) and flush them to the remote 
destination iff background flushing is currently
+ *   enabled.
+ */
+public class Client extends ConnectionWithStatus implements 
IStatefulObject {
+
 private static final Logger LOG = 
LoggerFactory.getLogger(Client.class);
 private static final String PREFIX = Netty-Client-;
-private final int max_retries;
-private final int base_sleep_ms;
-private final int max_sleep_ms;
+private static final long NO_DELAY_MS = 0L;
+private static final long MINIMUM_INITIAL_DELAY_MS = 3L;
+private static final long PENDING_MESSAGES_FLUSH_TIMEOUT_MS = 60L;
+private static final long PENDING_MESSAGES_FLUSH_INTERVAL_MS = 1000L;
+private static final long DISTANT_FUTURE_TIME_MS = Long.MAX_VALUE;
+
 private final StormBoundedExponentialBackoffRetry retryPolicy;
-private AtomicReferenceChannel channelRef;
 private final ClientBootstrap bootstrap;
-private InetSocketAddress remote_addr;
-
-private AtomicInteger totalReconnects;
-private AtomicInteger messagesSent;
-private AtomicInteger messagesLostReconnect;
-private final Random random = new Random();
-private final ChannelFactory factory;
-private final int buffer_size;
-private boolean closing;
-
-private int messageBatchSize;
-
-private AtomicLong pendings;
-
-Map storm_conf;
+private final InetSocketAddress dstAddress;
+protected final String dstAddressPrefixedName;
+
+/**
+ * The channel used for all write operations from this client to the 
remote destination.
+ */
+private final AtomicReferenceChannel channelRef = new 
AtomicReferenceChannel(null);
+
+
+/**
+ * Maximum number of reconnection attempts we will perform after a 
disconnect before giving up.
+ */
+private final int maxReconnectionAttempts;
+
+/**
+ * Total number of connection attempts.
+ */
+private final AtomicInteger totalConnectionAttempts = new 
AtomicInteger(0);
+
+/**
+ * Number of connection attempts since the last disconnect.
+ */
+private final AtomicInteger connectionAttempts = new AtomicInteger(0);
+
+/**
+ * Number of messages successfully sent to the remote destination.
+ */
+private final AtomicInteger messagesSent = new AtomicInteger(0);
+
+/**
+ * Number of messages that could not be sent to the remote destination.
+ */
+private final AtomicInteger messagesLost = new AtomicInteger(0);
+
+/**
+ * Number of messages buffered in memory.
+ */
+private final AtomicLong pendingMessages = new AtomicLong(0);
+
+/**
+ * This flag is set to true if and only if a client instance is being 
closed.
+ */
+private volatile boolean closing = false;
+
+/**
+ * When set to true, then the background flusher thread will flush any 
pending messages on its next run.
+ */
+private final AtomicBoolean backgroundFlushingEnabled = new 
AtomicBoolean(false);
+
+/**
+ * The absolute time (in ms) when the next background flush should be 
performed.
+ *
+ * Note: The 

[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-12 Thread danielschonfeld
Github user danielschonfeld commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-74130344
  
Doesn't dropping the messages coming from a non ack/fail caring spout 
negate the 'at least once' attempt of storm? I mean doesn't that kinda force 
you to make all your spouts ack/fail aware where before you could have gotten 
away without it?

In other words.  There is a chance that if the worker that died is the one 
containing the spout and if the first bolt is located on another worker, that 
technically at-least once wasn't tried but rather fell to the floor right away.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-12 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-74147926
  
If you need at-least-once processing you must use an acking topology, which 
will allow Storm to replay lost messages.  If instead you go with an unacking 
topology (= no guaranteed message processing) then you may run into data loss.  
There re pros and cons for each variant, and e.g. in our case we use both 
depending on the use case.

Also: The semantics described above have been in Storm right from the 
beginning.  None of these have been changed by this pull request.


 On 12.02.2015, at 20:01, Daniel Schonfeld notificati...@github.com 
wrote:
 
 Doesn't dropping the messages coming from a non ack/fail caring spout 
negate the 'at least once' attempt of storm? I mean doesn't that kinda force 
you to make all your spouts ack/fail aware where before you could have gotten 
away without it?
 
 In other words. There is a chance that if the worker that died is the one 
containing the spout and if the first bolt is located on another worker, that 
technically at-least once wasn't tried but rather fell to the floor right away.
 
 —
 Reply to this email directly or view it on GitHub.
 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-12 Thread nathanmarz
Github user nathanmarz commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-74036235
  
I retract my earlier -1. It was mentioned that this enables backpressure 
for unacked topologies. Is this the case? If so, this is a great new feature of 
Storm and there should be tests added testing this behavior. Namely, it should 
test that:

- Spouts stop emitting when all buffers fill up
- Topology recovers on worker death and only a subset of messages are 
dropped in that case


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-12 Thread clockfly
Github user clockfly commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-74084548
  
+1

On Thu, Feb 12, 2015 at 6:44 PM, Michael G. Noll notificati...@github.com
wrote:

 Thanks for your feedback, Nathan.

 As far as I understand this patch does not enable backpressure. But:
 because there is no backpressure (yet) that we can rely on, this patch 
will
 improve at least the situation during the startup phase of a topology to
 prevent that a) an unacked topo will not lose messages during the startup,
 and b) we do not need to unnecessarily replay messages in case of acked
 topos during their startup. This is achieved by checking that all worker
 connections are ready before the topology starts processing data.

 So backpressure is still an open feature. Backpressure was IIRC mentioned
 in the initial PR because there was a deficiency (dating back to a ZMQ
 related TODO) that caused problems related to this PR/Storm tickets (327,
 404, and one more). However, this patch does make the best of the current
 situation even in the absence of backpressure. But first and foremost this
 patch fixes a (critical) cascading failure that can bring Storm clusters 
to
 a halt.

 Please correct me if I'm mistaken in my summary.

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/storm/pull/429#issuecomment-74050954.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-12 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-74032874
  
This patch allows a worker to properly detect that the connection to a peer 
becomes unavailable -- for whatever reason (the remote worker is dead or 
restarting, there was a network glitch, etc). Also, any reconnection attempts 
are now async so that reconnecting will not block other activities of the 
worker (such as sending messages to other workers it is still connected to).

So to your question: this patch includes the case of the worker that 
remained alive trying to (re)connect to a dead peer.

Does that help?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-11 Thread clockfly
Github user clockfly commented on the pull request:

https://github.com/apache/storm/pull/428#issuecomment-73925561
  
Something like git rebase -i upstream/master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-11 Thread miguno
GitHub user miguno opened a pull request:

https://github.com/apache/storm/pull/428

STORM-329: fix cascading Storm failure by improving reconnection strategy 
and buffering messages

This is an improved version of the original pull request discussed at 
https://github.com/apache/storm/pull/268.  Please refer to the discussion in 
the link above.

**Note**:  Please give attribution to @tedxia when merging the pull request 
as he did a lot (most?) of the work in this pull request.

The changes of this pull request include:

- Most importantly, we fix a bug in Storm that may cause a cascading 
failure in a Storm cluster, to the point where the whole cluster becomes 
unusable.  This is achieved by the work described in the next bullet points.
- We refactor and improve the Netty messaging backend, notably the client.
- During the initial startup of a topology, Storm will now wait until 
worker (Netty) connections are ready for operation.  See the [original 
discussion thread](https://github.com/apache/storm/pull/268) for the detailed 
explanation and justification of this change.

@clockfly, @tedxia: Please add any further comments to STORM-329 to this 
pull request, if possible.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/miguno/storm 0.10.0-SNAPSHOT-STORM-329

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/storm/pull/428.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #428


commit 8c52a5b518021a6beff372acbeb66a963a1d4f74
Author: xiajun xia...@xiaomi.com
Date:   2014-09-24T12:39:18Z

STORM-329 : buffer message in client and reconnect remote server async

commit 7cc8c8b1b59d415f1fa54e081127336fcdaeb706
Author: xiajun xia...@xiaomi.com
Date:   2014-09-26T02:43:55Z

STORM-329: fix continue flush after client had been closed

commit 9826bc9ef7aee8c83e90d56f52ac70ac7165d769
Author: xiajun xia...@xiaomi.com
Date:   2014-09-30T03:25:08Z

Client not clean timeout TaskMessage

commit 978c969fdb4b5904b6a87c100fbd80fe26bf39cf
Author: Sean Zhong clock...@gmail.com
Date:   2014-10-20T01:08:43Z

Merge remote-tracking branch 'upstream/master'

commit 44f8260bbf489c6a2741fb4d8f9196ea6ddb51cc
Author: Sean Zhong clock...@gmail.com
Date:   2014-10-20T01:21:25Z

STORM-404, STORM-510, STORM-329: 1. break the reconnection to target if 
worker is informed that the target is the down, we that we avoid 
RuntimeException when reconnection failed. 2. When worker get started, need to 
make sure all target workers are alive before launching spouts 3. When target 
worker is down, all messages sending to the target worker will be dropped.

commit dea5fbe35c4d9b18a89dae320f9fc985f25bd31a
Author: Sean Zhong clock...@gmail.com
Date:   2014-10-20T01:24:45Z

STORM-329: fix comment grammar

commit 16de9f3321827624865a4450e93bd53efb75ed93
Author: Sean Zhong clock...@gmail.com
Date:   2014-10-28T03:31:23Z

test

commit e8dcf9155c85d3541c9352faf9a0651614b93eb6
Author: Sean Zhong clock...@gmail.com
Date:   2014-10-29T01:56:19Z

STORM-329: avoid deadlock

commit 22e7014dedfd580de9dd1d6b2083c8fb3d77d406
Author: Sean Zhong clock...@gmail.com
Date:   2014-10-29T01:59:09Z

Revert test

This reverts commit 16de9f3321827624865a4450e93bd53efb75ed93.

commit ddef6667cdc6de9aebf0d4006ad9e7df2bfbb3bb
Author: Sean Zhong clock...@gmail.com
Date:   2014-10-29T08:14:34Z

Merge remote-tracking branch 'upstream/master'

commit baf3c628db3899def89ee92c752524d041bc8b40
Author: Sean Zhong clock...@gmail.com
Date:   2014-10-30T10:17:07Z

STORM-329: fix UT. Add a new flag in worker data worker-active-flag, Wait 
connections to be ready asyncly.

commit e1c463f5681dbf6a66c868d467aca14064da1e9b
Author: Sean Zhong clock...@gmail.com
Date:   2014-10-30T10:22:22Z

STORM-329: fix comments. Add a description about 
storm.messaging.netty.max_retries, that the reconnection period should also 
be bigger than storm.zookeeper.session.timeout, so that the reconnection can 
be aborted(when target worker is dead) before the reconnection failed and throw 
RunTimeException

commit 2d3fad121481da40258af27e6d7fbcb148365e76
Author: Sean Zhong clock...@gmail.com
Date:   2014-10-30T10:22:22Z

STORM-329: fix a integration issue

commit 60f04f9e397cf49e4fba6fe6a2f0bfb23d5a8605
Author: Sean Zhong clock...@gmail.com
Date:   2014-10-30T10:28:17Z

Merge branch 'master' of https://github.com/tedxia/incubator-storm

commit 41aafbecac2cf3295255c7dc9b299b8c0c555390
Author: xiajun xia...@xiaomi.com
Date:   2014-11-18T06:53:14Z

Merge remote-tracking branch 'remotes/apache-storm/0.9.3-branch' into 
ted-master

Conflicts:
storm-core/src/jvm/backtype/storm/messaging/netty/Client.java

commit 2c39866cf8bbca136d3b88f796b9f847b282fdd7
Author: xiajun 

[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-11 Thread clockfly
Github user clockfly commented on the pull request:

https://github.com/apache/storm/pull/428#issuecomment-73932513
  
I see, there are multiple remote-merging, which then make rebase 
impossible. 

How about create a patch file against master and then apply the patch with 
new commit logs?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-11 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/428#issuecomment-73946399
  
I did exactly this in #429.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-11 Thread miguno
Github user miguno closed the pull request at:

https://github.com/apache/storm/pull/428


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-11 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/428#issuecomment-73928303
  
I tried rebasing (also to fix the incorrect commit message that starts with 
STORM-32*7*) but gave up after several failed attempts.  Feel free to give it a 
try though -- maybe your git-fu is better than mine. :-)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-11 Thread miguno
GitHub user miguno opened a pull request:

https://github.com/apache/storm/pull/429

STORM-329: fix cascading Storm failure by improving reconnection strategy 
and buffering messages

**This PR contains the same code as 
https://github.com/apache/storm/pull/428 but as a single commit for a cleaner 
commit history of our Storm repo.**

--

This is an improved version of the original pull request discussed at 
https://github.com/apache/storm/pull/268.  Please refer to the discussion in 
the link above.

**Note**:  Please give attribution to @tedxia when merging the pull request 
as he did a lot (most?) of the work in this pull request.

The changes of this pull request include:

- Most importantly, we fix a bug in Storm that may cause a cascading 
failure in a Storm cluster, to the point where the whole cluster becomes 
unusable.  This is achieved by the work described in the next bullet points.
- We refactor and improve the Netty messaging backend, notably the client.
- During the initial startup of a topology, Storm will now wait until 
worker (Netty) connections are ready for operation.  See the [original 
discussion thread](https://github.com/apache/storm/pull/268) for the detailed 
explanation and justification of this change.

@clockfly, @tedxia: Please add any further comments to STORM-329 to this 
pull request, if possible.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/miguno/storm 0.10.0-SNAPSHOT-STORM-329-diff

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/storm/pull/429.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #429


commit 205eaf4ebe28ab5550a842ea9aabd23b41678743
Author: Michael G. Noll mn...@verisign.com
Date:   2015-02-11T18:55:53Z

STORM-329: fix cascading Storm failure by improving reconnection strategy 
and buffering messages

Thanks to @tedxia for the initial work on this patch, which covered a
lot if not most of the work!




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-11 Thread danielschonfeld
Github user danielschonfeld commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-74026208
  
I admit to not understanding all the intricacies of the code as i'm still 
coming to terms with the different parts of storm.  However, does this PR 
handle the case of the worker that remained alive trying to refresh a 
connection to a dead worker or a dead worker that is in the process of starting 
up?

I seem to see the reverse being handled, but not in the direction mentioned 
above.  Of course that could also be cause i don't understand the code fully.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---