[jira] [Created] (STORM-2199) Module enabling storm to write memory mapped files

2016-11-10 Thread Mariamma Antony (JIRA)
Mariamma Antony created STORM-2199:
--

 Summary: Module enabling storm to write memory mapped files
 Key: STORM-2199
 URL: https://issues.apache.org/jira/browse/STORM-2199
 Project: Apache Storm
  Issue Type: Improvement
  Components: storm-core
Reporter: Mariamma Antony
 Fix For: 2.0.0


Add module to write from storm to memory-mapped files.
# Support multiple file formats 
# Support multiple file rotation and sync policy implementation







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (STORM-2198) perform RotationAction when stopping HdfsBolt

2016-11-10 Thread Xin Wang (JIRA)
Xin Wang created STORM-2198:
---

 Summary: perform RotationAction when stopping HdfsBolt
 Key: STORM-2198
 URL: https://issues.apache.org/jira/browse/STORM-2198
 Project: Apache Storm
  Issue Type: Bug
  Components: storm-hdfs
Reporter: Xin Wang
Assignee: Xin Wang


I have a _HdfsBolt_ with _TimedRotationPolicy_ and _MoveFileAction_. I find the 
bolt don't move file when I stop the HdfsBolt and then _RotationPolicy_ is not 
triggered, 
Look at the code, the _rotateOutputFile_ method just be called when 
_RotationPolicy_ is triggered or _writer.needsRotation_. I will add some logic 
in _cleanup_ method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-2044) Nimbus should not make assignments crazily when Pacemaker goes down and up

2016-11-10 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15655721#comment-15655721
 ] 

Saikat Kanjilal commented on STORM-2044:


Hi [~chenyuzhao], I'm new to storm and am interested in contributing, can I 
help with this issue?  Let me know more details and logical first steps.

> Nimbus should not make assignments crazily when Pacemaker goes down and up
> --
>
> Key: STORM-2044
> URL: https://issues.apache.org/jira/browse/STORM-2044
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Affects Versions: 1.0.2
> Environment: CentOS 6.5
>Reporter: Yuzhao Chen
>  Labels: patch
> Fix For: 1.1.0
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Now pacemaker is a stand-alone service and no HA is supported. When 
> it goes down, all the workers's heartbeats will be lost. It will take a long 
> time to recover even if pacemaker goes up immediately if there are dozens GB 
> of heartbeats. During the time worker heartbeats are not restored completely, 
> Nimbus will think these workers are dead because of heartbeats timeout and 
> reassign these "dead" workers continuously until heartbeats restore to 
> normal. So, during recovery time, many topologies will be reassigned 
> continuously and the throughout will goes very down.  
> This is not acceptable. 
> So i think, pacemaker is not suitable for production if the problem 
> above exists.
>i think several ways to solve this problem:
>   1. pacemaker HA
>   2. when pacemaker does down, notice nimbus not to reassign any 
> more until it recover



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-2164) Create simple generic plugin system to register codahale reporters

2016-11-10 Thread P. Taylor Goetz (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15654647#comment-15654647
 ] 

P. Taylor Goetz commented on STORM-2164:


[~abellina] the feature branch will be in the Apache repo. We can relax the 
commit rules for that branch while we work on features, then when ready, we 
create a pull request from that branch to an official version branch for formal 
review.

I'll create a branch off 1.x-branch called "metrics_v2" so we can start 
creating pull requests, etc. 

bq. I should have some code to show for this JIRA soon. I've been generating 
metrics from the workers and reporting them using the file system, and having 
Slots pick them up and send over to Nimbus via thrift. This is not related to 
this JIRA specifically, but I wanted to get that going before doing configs.

Yes, there's going to overlap with what we're doing. For now, I have a 
hard-coded reporter in my code that will obviously go away once your config 
work is ready.

> Create simple generic plugin system to register codahale reporters
> --
>
> Key: STORM-2164
> URL: https://issues.apache.org/jira/browse/STORM-2164
> Project: Apache Storm
>  Issue Type: Improvement
>Reporter: Alessandro Bellina
>Assignee: Alessandro Bellina
>
> Configurable plugin interface s.t. daemons can instantiate codahale reporters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (STORM-2194) ReportErrorAndDie doesn't always die

2016-11-10 Thread Craig Hawco (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15654519#comment-15654519
 ] 

Craig Hawco edited comment on STORM-2194 at 11/10/16 4:49 PM:
--

Will do, however after looking through STORM-772 and STORM-773, it's not 
entirely clear what the intent was. 

Basically, my understand is that the exception passed here is the exception 
from the worker -- and {{exception-cause?}} walks the entire exception chain 
checking for {{InterruptedException}} or {{InterruptedIOException}} -- which 
may have been due to a bolt talking to an external service (e.g. 
{{SocketTimeoutException}} is an {{InterruptedIOException}}). 

So, some options:

# My understanding is incorrect, and this error is NOT from the bolt itself
# {{exception-cause?}} doesn't walk the entire exception tree as I thought
# This behaviour is mostly intentional, but should only be applied if error 
itself is {{InterruptedException}} or {{InterruptedIOException}}, but not if 
those appear anywhere in the chain

I think #2 above is the case:

{code}
user=> (def ex (new RuntimeException (new InterruptedException)))
#'user/ex
(defn exception-cause?
  [klass ^Throwable t]
 (->> (iterate #(.getCause ^Throwable %) t)
   (take-while identity)
   (some (partial instance? klass))
   boolean))
  (->> (iterate #(.getCause ^Throwable %) t)
  (take-while identity)
   (some (partial instance? klass))
   boolean))
   (take-while identity)
   (some (partial instance? klass))
   boolean))
#'user/exception-cause?
user=> (exception-cause? InterruptedException ex)
true
user=>
{code}

So, either the exception isn't the one raised by the bolt (I'll look for some 
evidence to support this next), the check should only be ignoring errors that 
are themselves {{InterruptedException}}/{{InterruptedIOException}} but not 
looking elsewhere in the chain, or this was intended as just a logging line, 
and {{sucide-fn}} should still get invoked after logging.


was (Author: chawco):
Will do, however after looking through STORM-772 and STORM-773, it's not 
entirely clear what the intent was. 

Basically, my understand is that the exception passed here is the exception 
from the worker -- and {{exception-cause?}} walks the entire exception chain 
checking for {{InterruptedException}} or {{InterruptedIOException}} -- which 
may have been due to a bolt talking to an external service (e.g. 
{{SocketTimeoutException}} is an {{InterruptedIOException}}). 

So, some options:

# My understanding is incorrect, and this error is NOT from the bolt itself
# {{exception-cause?}} doesn't walk the entire exception tree as I thought
# This behaviour is mostly intentional, but should only be applied if error 
itself is {{InterruptedException}} or {{InterruptedIOException}}, but not if 
those appear anywhere in the chain

I think #3 above isn't actually case:

{code}
user=> (def ex (new RuntimeException (new InterruptedException)))
#'user/ex
(defn exception-cause?
  [klass ^Throwable t]
 (->> (iterate #(.getCause ^Throwable %) t)
   (take-while identity)
   (some (partial instance? klass))
   boolean))
  (->> (iterate #(.getCause ^Throwable %) t)
  (take-while identity)
   (some (partial instance? klass))
   boolean))
   (take-while identity)
   (some (partial instance? klass))
   boolean))
#'user/exception-cause?
user=> (exception-cause? InterruptedException ex)
true
user=>
{code}

So, either the exception isn't the one raised by the bolt (I'll look for some 
evidence to support this next), the check should only be ignoring errors that 
are themselves {{InterruptedException}}/{{InterruptedIOException}} but not 
looking elsewhere in the chain, or this was intended as just a logging line, 
and {{sucide-fn}} should still get invoked after logging.

> ReportErrorAndDie doesn't always die
> 
>
> Key: STORM-2194
> URL: https://issues.apache.org/jira/browse/STORM-2194
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-core
>Affects Versions: 2.0.0, 1.0.2
>Reporter: Craig Hawco
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I've been trying to track down a cause of some of our issues with some 
> exceptions leaving Storm workers in a zombified state for some time. I 
> believe I've isolated the bug to the behaviour in 
> :report-error-and-die/reportErrorAndDie in the executor. Essentially:
> {code}
>  :report-error-and-die (fn [error]
>  (try
>((:report-error <>) error)
>(catch Exception e
>  (log-message "Error while reporting error to 
> cluster, proceeding with shutdown")))
>  (if (or
> 

[jira] [Commented] (STORM-2194) ReportErrorAndDie doesn't always die

2016-11-10 Thread Craig Hawco (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15654519#comment-15654519
 ] 

Craig Hawco commented on STORM-2194:


Will do, however after looking through STORM-772 and STORM-773, it's not 
entirely clear what the intent was. 

Basically, my understand is that the exception passed here is the exception 
from the worker -- and {{exception-cause?}} walks the entire exception chain 
checking for {{InterruptedException}} or {{InterruptedIOException}} -- which 
may have been due to a bolt talking to an external service (e.g. 
{{SocketTimeoutException}} is an {{InterruptedIOException}}). 

So, some options:

# My understanding is incorrect, and this error is NOT from the bolt itself
# {{exception-cause?}} doesn't walk the entire exception tree as I thought
# This behaviour is mostly intentional, but should only be applied if error 
itself is {{InterruptedException}} or {{InterruptedIOException}}, but not if 
those appear anywhere in the chain

I think #3 above isn't actually case:

{code}
user=> (def ex (new RuntimeException (new InterruptedException)))
#'user/ex
(defn exception-cause?
  [klass ^Throwable t]
 (->> (iterate #(.getCause ^Throwable %) t)
   (take-while identity)
   (some (partial instance? klass))
   boolean))
  (->> (iterate #(.getCause ^Throwable %) t)
  (take-while identity)
   (some (partial instance? klass))
   boolean))
   (take-while identity)
   (some (partial instance? klass))
   boolean))
#'user/exception-cause?
user=> (exception-cause? InterruptedException ex)
true
user=>
{code}

So, either the exception isn't the one raised by the bolt (I'll look for some 
evidence to support this next), the check should only be ignoring errors that 
are themselves {{InterruptedException}}/{{InterruptedIOException}} but not 
looking elsewhere in the chain, or this was intended as just a logging line, 
and {{sucide-fn}} should still get invoked after logging.

> ReportErrorAndDie doesn't always die
> 
>
> Key: STORM-2194
> URL: https://issues.apache.org/jira/browse/STORM-2194
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-core
>Affects Versions: 2.0.0, 1.0.2
>Reporter: Craig Hawco
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I've been trying to track down a cause of some of our issues with some 
> exceptions leaving Storm workers in a zombified state for some time. I 
> believe I've isolated the bug to the behaviour in 
> :report-error-and-die/reportErrorAndDie in the executor. Essentially:
> {code}
>  :report-error-and-die (fn [error]
>  (try
>((:report-error <>) error)
>(catch Exception e
>  (log-message "Error while reporting error to 
> cluster, proceeding with shutdown")))
>  (if (or
> (exception-cause? InterruptedException 
> error)
> (exception-cause? 
> java.io.InterruptedIOException error))
>(log-message "Got interrupted excpetion 
> shutting thread down...")
>((:suicide-fn <>
> {code}
> has the grouping for the if statement slightly wrong. It shouldn't log OR die 
> from InterruptedException/InterruptedIOException, but it should log under 
> that condition, and ALWAYS die. 
> Basically:
> {code}
>  :report-error-and-die (fn [error]
>  (try
>((:report-error <>) error)
>(catch Exception e
>  (log-message "Error while reporting error to 
> cluster, proceeding with shutdown")))
>  (if (or
> (exception-cause? InterruptedException 
> error)
> (exception-cause? 
> java.io.InterruptedIOException error))
>(log-message "Got interrupted excpetion 
> shutting thread down..."))
>  ((:suicide-fn <>)))
> {code}
> After digging into the Java port of this code, it looks like a different bug 
> was introduced while porting:
> {code}
> if (Utils.exceptionCauseIsInstanceOf(InterruptedException.class, e)
> || 
> Utils.exceptionCauseIsInstanceOf(java.io.InterruptedIOException.class, e)) {
> LOG.info("Got interrupted exception shutting thread down...");
> suicideFn.run();
> }
> {code}
> Was how this was initially ported, and STORM-2142 changed this to:
> {code}
> if (Utils.exceptionCauseIsInstanceOf(InterruptedException.class, e)
> || 
> 

[jira] [Commented] (STORM-2194) ReportErrorAndDie doesn't always die

2016-11-10 Thread Jungtaek Lim (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653457#comment-15653457
 ] 

Jungtaek Lim commented on STORM-2194:
-

Could you attach thread dump when you got zombie Storm worker? That was 
introduced on STORM-773 but haven't received similar report.

> ReportErrorAndDie doesn't always die
> 
>
> Key: STORM-2194
> URL: https://issues.apache.org/jira/browse/STORM-2194
> Project: Apache Storm
>  Issue Type: Bug
>  Components: storm-core
>Affects Versions: 2.0.0, 1.0.2
>Reporter: Craig Hawco
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I've been trying to track down a cause of some of our issues with some 
> exceptions leaving Storm workers in a zombified state for some time. I 
> believe I've isolated the bug to the behaviour in 
> :report-error-and-die/reportErrorAndDie in the executor. Essentially:
> {code}
>  :report-error-and-die (fn [error]
>  (try
>((:report-error <>) error)
>(catch Exception e
>  (log-message "Error while reporting error to 
> cluster, proceeding with shutdown")))
>  (if (or
> (exception-cause? InterruptedException 
> error)
> (exception-cause? 
> java.io.InterruptedIOException error))
>(log-message "Got interrupted excpetion 
> shutting thread down...")
>((:suicide-fn <>
> {code}
> has the grouping for the if statement slightly wrong. It shouldn't log OR die 
> from InterruptedException/InterruptedIOException, but it should log under 
> that condition, and ALWAYS die. 
> Basically:
> {code}
>  :report-error-and-die (fn [error]
>  (try
>((:report-error <>) error)
>(catch Exception e
>  (log-message "Error while reporting error to 
> cluster, proceeding with shutdown")))
>  (if (or
> (exception-cause? InterruptedException 
> error)
> (exception-cause? 
> java.io.InterruptedIOException error))
>(log-message "Got interrupted excpetion 
> shutting thread down..."))
>  ((:suicide-fn <>)))
> {code}
> After digging into the Java port of this code, it looks like a different bug 
> was introduced while porting:
> {code}
> if (Utils.exceptionCauseIsInstanceOf(InterruptedException.class, e)
> || 
> Utils.exceptionCauseIsInstanceOf(java.io.InterruptedIOException.class, e)) {
> LOG.info("Got interrupted exception shutting thread down...");
> suicideFn.run();
> }
> {code}
> Was how this was initially ported, and STORM-2142 changed this to:
> {code}
> if (Utils.exceptionCauseIsInstanceOf(InterruptedException.class, e)
> || 
> Utils.exceptionCauseIsInstanceOf(java.io.InterruptedIOException.class, e)) {
> LOG.info("Got interrupted exception shutting thread down...");
> } else {
> suicideFn.run();
> }
> {code}
> However, I believe the correct port is as described above:
> {code}
> if (Utils.exceptionCauseIsInstanceOf(InterruptedException.class, e)
> || 
> Utils.exceptionCauseIsInstanceOf(java.io.InterruptedIOException.class, e)) {
> LOG.info("Got interrupted exception shutting thread down...");
> }
> suicideFn.run();
> {code}
> I'll look into providing patches for the 1.x and 2.x branches shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)