[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-30 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986220#comment-13986220
 ] 

Tathagata Das commented on SPARK-1645:
--

This makes sense from the integration point of view. Though I wonder from 
thePOV of Flume's deployment configuration does it make things more complex? 
Like for example, if someone has a the flume system already setup, in the 
current situation, the configuration change to add a new sink seems standard 
and easy. However, in the proposed model, since Flume's data pushing node has 
to run a sink, how much complicated does this configuration process get?

 Improve Spark Streaming compatibility with Flume
 

 Key: SPARK-1645
 URL: https://issues.apache.org/jira/browse/SPARK-1645
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan

 Currently the following issues affect Spark Streaming and Flume compatibilty:
 * If a spark worker goes down, it needs to be restarted on the same node, 
 else Flume cannot send data to it. We can fix this by adding a Flume receiver 
 that is polls Flume, and a Flume sink that supports this.
 * Receiver sends acks to Flume before the driver knows about the data. The 
 new receiver should also handle this case.
 * Data loss when driver goes down - This is true for any streaming ingest, 
 not just Flume. I will file a separate jira for this and we should work on it 
 there. This is a longer term project and requires considerable development 
 work.
 I intend to start working on these soon. Any input is appreciated. (It'd be 
 great if someone can add me as a contributor on jira, so I can assign the 
 jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-30 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986275#comment-13986275
 ] 

Hari Shreedharan commented on SPARK-1645:
-

Yes, it is better to add new methods rather than reusing the old ones and 
confusing existing users.

In fact, I think we should add a new receiver for the time being and only 
deprecate the old one initially. We can remove the old one in a later release.

 Improve Spark Streaming compatibility with Flume
 

 Key: SPARK-1645
 URL: https://issues.apache.org/jira/browse/SPARK-1645
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan

 Currently the following issues affect Spark Streaming and Flume compatibilty:
 * If a spark worker goes down, it needs to be restarted on the same node, 
 else Flume cannot send data to it. We can fix this by adding a Flume receiver 
 that is polls Flume, and a Flume sink that supports this.
 * Receiver sends acks to Flume before the driver knows about the data. The 
 new receiver should also handle this case.
 * Data loss when driver goes down - This is true for any streaming ingest, 
 not just Flume. I will file a separate jira for this and we should work on it 
 there. This is a longer term project and requires considerable development 
 work.
 I intend to start working on these soon. Any input is appreciated. (It'd be 
 great if someone can add me as a contributor on jira, so I can assign the 
 jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-30 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986312#comment-13986312
 ] 

Tathagata Das commented on SPARK-1645:
--

I agree. New receiver for the new API is the best way. 

 Improve Spark Streaming compatibility with Flume
 

 Key: SPARK-1645
 URL: https://issues.apache.org/jira/browse/SPARK-1645
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan

 Currently the following issues affect Spark Streaming and Flume compatibilty:
 * If a spark worker goes down, it needs to be restarted on the same node, 
 else Flume cannot send data to it. We can fix this by adding a Flume receiver 
 that is polls Flume, and a Flume sink that supports this.
 * Receiver sends acks to Flume before the driver knows about the data. The 
 new receiver should also handle this case.
 * Data loss when driver goes down - This is true for any streaming ingest, 
 not just Flume. I will file a separate jira for this and we should work on it 
 there. This is a longer term project and requires considerable development 
 work.
 I intend to start working on these soon. Any input is appreciated. (It'd be 
 great if someone can add me as a contributor on jira, so I can assign the 
 jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-29 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984734#comment-13984734
 ] 

Hari Shreedharan commented on SPARK-1645:
-

Yes, so I have a rough design for that in mind. The idea is to add a sink which 
plugs into Flume, which gets polled by the Spark receiver. That way, even if 
the node on which the worker is running fails, the receiver on another node can 
poll the sink and pull data. From the Flume point of view, the sink does not 
conform to the definition of standard sinks (all Flume sinks are push only), 
but it can be written such that we don't lose data. Later if/when Flume adds 
support for pollable sinks this sink can be ported.

 Improve Spark Streaming compatibility with Flume
 

 Key: SPARK-1645
 URL: https://issues.apache.org/jira/browse/SPARK-1645
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan

 Currently the following issues affect Spark Streaming and Flume compatibilty:
 * If a spark worker goes down, it needs to be restarted on the same node, 
 else Flume cannot send data to it. We can fix this by adding a Flume receiver 
 that is polls Flume, and a Flume sink that supports this.
 * Receiver sends acks to Flume before the driver knows about the data. The 
 new receiver should also handle this case.
 * Data loss when driver goes down - This is true for any streaming ingest, 
 not just Flume. I will file a separate jira for this and we should work on it 
 there. This is a longer term project and requires considerable development 
 work.
 I intend to start working on these soon. Any input is appreciated. (It'd be 
 great if someone can add me as a contributor on jira, so I can assign the 
 jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-29 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984768#comment-13984768
 ] 

Tathagata Das commented on SPARK-1645:
--

Let me understand this. Is this sink going to run as a separate process outside 
the Spark executor? If it is running as a thread in the same executor process 
as the receiver, then that is no better than what it is now, as it will fail 
when the executor fails. So I am guessing it will be a process outside the 
executor. Doesnt introduce the headache of managing that process separately? 
And what happens when the whole worker node dies? 

 Improve Spark Streaming compatibility with Flume
 

 Key: SPARK-1645
 URL: https://issues.apache.org/jira/browse/SPARK-1645
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan

 Currently the following issues affect Spark Streaming and Flume compatibilty:
 * If a spark worker goes down, it needs to be restarted on the same node, 
 else Flume cannot send data to it. We can fix this by adding a Flume receiver 
 that is polls Flume, and a Flume sink that supports this.
 * Receiver sends acks to Flume before the driver knows about the data. The 
 new receiver should also handle this case.
 * Data loss when driver goes down - This is true for any streaming ingest, 
 not just Flume. I will file a separate jira for this and we should work on it 
 there. This is a longer term project and requires considerable development 
 work.
 I intend to start working on these soon. Any input is appreciated. (It'd be 
 great if someone can add me as a contributor on jira, so I can assign the 
 jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-29 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984816#comment-13984816
 ] 

Hari Shreedharan commented on SPARK-1645:
-

No, Flume source and sink reside within the same 
JVM(http://flume.apache.org/FlumeUserGuide.html#architecture). So the receiver 
polls the Flume sink running on a different node (the node that runs the Flume 
agent pushing the data). If the node running the receiver goes down, then 
another worker starts up and reads from the same Flume agent. If the Flume 
agent goes down the receiver polls and fails to get data until the agent is 
back up. 

 Improve Spark Streaming compatibility with Flume
 

 Key: SPARK-1645
 URL: https://issues.apache.org/jira/browse/SPARK-1645
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan

 Currently the following issues affect Spark Streaming and Flume compatibilty:
 * If a spark worker goes down, it needs to be restarted on the same node, 
 else Flume cannot send data to it. We can fix this by adding a Flume receiver 
 that is polls Flume, and a Flume sink that supports this.
 * Receiver sends acks to Flume before the driver knows about the data. The 
 new receiver should also handle this case.
 * Data loss when driver goes down - This is true for any streaming ingest, 
 not just Flume. I will file a separate jira for this and we should work on it 
 there. This is a longer term project and requires considerable development 
 work.
 I intend to start working on these soon. Any input is appreciated. (It'd be 
 great if someone can add me as a contributor on jira, so I can assign the 
 jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-29 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984927#comment-13984927
 ] 

Tathagata Das commented on SPARK-1645:
--

Ah, I think get it now. So instead of the default push-based as it is now 
(where a sink is running with the receiver), you simply want to make 
pull-based. 

So if the current situation is this 

!http://i.imgur.com/m8oiOwl.png?1!  

you propose this  

!http://i.imgur.com/N6Ee1cb.png?1!

Right?
Assuming it is right, that does make it very convenient for Spark Streaming's 
receivers. However what does it mean for reliable receiving? When the receiver 
pulls the data from the source, it will acknowledge the source only when the 
Spark acknowledges that it has reliably saved the data?


 Improve Spark Streaming compatibility with Flume
 

 Key: SPARK-1645
 URL: https://issues.apache.org/jira/browse/SPARK-1645
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan

 Currently the following issues affect Spark Streaming and Flume compatibilty:
 * If a spark worker goes down, it needs to be restarted on the same node, 
 else Flume cannot send data to it. We can fix this by adding a Flume receiver 
 that is polls Flume, and a Flume sink that supports this.
 * Receiver sends acks to Flume before the driver knows about the data. The 
 new receiver should also handle this case.
 * Data loss when driver goes down - This is true for any streaming ingest, 
 not just Flume. I will file a separate jira for this and we should work on it 
 there. This is a longer term project and requires considerable development 
 work.
 I intend to start working on these soon. Any input is appreciated. (It'd be 
 great if someone can add me as a contributor on jira, so I can assign the 
 jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-29 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984950#comment-13984950
 ] 

Hari Shreedharan commented on SPARK-1645:
-

The first one is not exactly accurate, though it explains the idea. The second 
is what I suggest.

As a first step, we do what is currently done - the receiver stores the data 
locally and acknowledges, so the reliability does not improve. Later we can 
make the improvement for all receivers that the data becomes persisted all the 
way to the driver (add a new API like storeReliably or something). 

We would have to do a two step Poll-ACK process. We can have the initial poll 
create a new request added to the ones that are pending for commit in the sink. 
Once the receiver has written it (for now in the current way, later reliably) - 
it sends out an ACK for the request id, that causes the request to be 
committed, so Flume can remove the events. If the receiver does not send the 
ACK, then the sink can have a scheduled thread come (this timeout can be 
specified in the flume config) and rollback and make the data available again 
(Flume already has the capability to make uncommitted txns to be made available 
if that agent fails).

 Improve Spark Streaming compatibility with Flume
 

 Key: SPARK-1645
 URL: https://issues.apache.org/jira/browse/SPARK-1645
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan

 Currently the following issues affect Spark Streaming and Flume compatibilty:
 * If a spark worker goes down, it needs to be restarted on the same node, 
 else Flume cannot send data to it. We can fix this by adding a Flume receiver 
 that is polls Flume, and a Flume sink that supports this.
 * Receiver sends acks to Flume before the driver knows about the data. The 
 new receiver should also handle this case.
 * Data loss when driver goes down - This is true for any streaming ingest, 
 not just Flume. I will file a separate jira for this and we should work on it 
 there. This is a longer term project and requires considerable development 
 work.
 I intend to start working on these soon. Any input is appreciated. (It'd be 
 great if someone can add me as a contributor on jira, so I can assign the 
 jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-28 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983971#comment-13983971
 ] 

Hari Shreedharan commented on SPARK-1645:
-

Yep, that is correct. I'd like to contribute to the design as much as possible 
- so perhaps we can work on the design document together. Once we start looking 
into this, we can definitely have to proceed on multiple fronts so we can get 
more of these features committed faster.

 Improve Spark Streaming compatibility with Flume
 

 Key: SPARK-1645
 URL: https://issues.apache.org/jira/browse/SPARK-1645
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan

 Currently the following issues affect Spark Streaming and Flume compatibilty:
 * If a spark worker goes down, it needs to be restarted on the same node, 
 else Flume cannot send data to it. We can fix this by adding a Flume receiver 
 that is polls Flume, and a Flume sink that supports this.
 * Receiver sends acks to Flume before the driver knows about the data. The 
 new receiver should also handle this case.
 * Data loss when driver goes down - This is true for any streaming ingest, 
 not just Flume. I will file a separate jira for this and we should work on it 
 there. This is a longer term project and requires considerable development 
 work.
 I intend to start working on these soon. Any input is appreciated. (It'd be 
 great if someone can add me as a contributor on jira, so I can assign the 
 jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)