[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume
[ https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986220#comment-13986220 ] Tathagata Das commented on SPARK-1645: -- This makes sense from the integration point of view. Though I wonder from thePOV of Flume's deployment configuration does it make things more complex? Like for example, if someone has a the flume system already setup, in the current situation, the configuration change to add a new sink seems standard and easy. However, in the proposed model, since Flume's data pushing node has to run a sink, how much complicated does this configuration process get? Improve Spark Streaming compatibility with Flume Key: SPARK-1645 URL: https://issues.apache.org/jira/browse/SPARK-1645 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Currently the following issues affect Spark Streaming and Flume compatibilty: * If a spark worker goes down, it needs to be restarted on the same node, else Flume cannot send data to it. We can fix this by adding a Flume receiver that is polls Flume, and a Flume sink that supports this. * Receiver sends acks to Flume before the driver knows about the data. The new receiver should also handle this case. * Data loss when driver goes down - This is true for any streaming ingest, not just Flume. I will file a separate jira for this and we should work on it there. This is a longer term project and requires considerable development work. I intend to start working on these soon. Any input is appreciated. (It'd be great if someone can add me as a contributor on jira, so I can assign the jira to myself). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume
[ https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986275#comment-13986275 ] Hari Shreedharan commented on SPARK-1645: - Yes, it is better to add new methods rather than reusing the old ones and confusing existing users. In fact, I think we should add a new receiver for the time being and only deprecate the old one initially. We can remove the old one in a later release. Improve Spark Streaming compatibility with Flume Key: SPARK-1645 URL: https://issues.apache.org/jira/browse/SPARK-1645 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Currently the following issues affect Spark Streaming and Flume compatibilty: * If a spark worker goes down, it needs to be restarted on the same node, else Flume cannot send data to it. We can fix this by adding a Flume receiver that is polls Flume, and a Flume sink that supports this. * Receiver sends acks to Flume before the driver knows about the data. The new receiver should also handle this case. * Data loss when driver goes down - This is true for any streaming ingest, not just Flume. I will file a separate jira for this and we should work on it there. This is a longer term project and requires considerable development work. I intend to start working on these soon. Any input is appreciated. (It'd be great if someone can add me as a contributor on jira, so I can assign the jira to myself). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume
[ https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986312#comment-13986312 ] Tathagata Das commented on SPARK-1645: -- I agree. New receiver for the new API is the best way. Improve Spark Streaming compatibility with Flume Key: SPARK-1645 URL: https://issues.apache.org/jira/browse/SPARK-1645 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Currently the following issues affect Spark Streaming and Flume compatibilty: * If a spark worker goes down, it needs to be restarted on the same node, else Flume cannot send data to it. We can fix this by adding a Flume receiver that is polls Flume, and a Flume sink that supports this. * Receiver sends acks to Flume before the driver knows about the data. The new receiver should also handle this case. * Data loss when driver goes down - This is true for any streaming ingest, not just Flume. I will file a separate jira for this and we should work on it there. This is a longer term project and requires considerable development work. I intend to start working on these soon. Any input is appreciated. (It'd be great if someone can add me as a contributor on jira, so I can assign the jira to myself). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume
[ https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984734#comment-13984734 ] Hari Shreedharan commented on SPARK-1645: - Yes, so I have a rough design for that in mind. The idea is to add a sink which plugs into Flume, which gets polled by the Spark receiver. That way, even if the node on which the worker is running fails, the receiver on another node can poll the sink and pull data. From the Flume point of view, the sink does not conform to the definition of standard sinks (all Flume sinks are push only), but it can be written such that we don't lose data. Later if/when Flume adds support for pollable sinks this sink can be ported. Improve Spark Streaming compatibility with Flume Key: SPARK-1645 URL: https://issues.apache.org/jira/browse/SPARK-1645 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Currently the following issues affect Spark Streaming and Flume compatibilty: * If a spark worker goes down, it needs to be restarted on the same node, else Flume cannot send data to it. We can fix this by adding a Flume receiver that is polls Flume, and a Flume sink that supports this. * Receiver sends acks to Flume before the driver knows about the data. The new receiver should also handle this case. * Data loss when driver goes down - This is true for any streaming ingest, not just Flume. I will file a separate jira for this and we should work on it there. This is a longer term project and requires considerable development work. I intend to start working on these soon. Any input is appreciated. (It'd be great if someone can add me as a contributor on jira, so I can assign the jira to myself). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume
[ https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984768#comment-13984768 ] Tathagata Das commented on SPARK-1645: -- Let me understand this. Is this sink going to run as a separate process outside the Spark executor? If it is running as a thread in the same executor process as the receiver, then that is no better than what it is now, as it will fail when the executor fails. So I am guessing it will be a process outside the executor. Doesnt introduce the headache of managing that process separately? And what happens when the whole worker node dies? Improve Spark Streaming compatibility with Flume Key: SPARK-1645 URL: https://issues.apache.org/jira/browse/SPARK-1645 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Currently the following issues affect Spark Streaming and Flume compatibilty: * If a spark worker goes down, it needs to be restarted on the same node, else Flume cannot send data to it. We can fix this by adding a Flume receiver that is polls Flume, and a Flume sink that supports this. * Receiver sends acks to Flume before the driver knows about the data. The new receiver should also handle this case. * Data loss when driver goes down - This is true for any streaming ingest, not just Flume. I will file a separate jira for this and we should work on it there. This is a longer term project and requires considerable development work. I intend to start working on these soon. Any input is appreciated. (It'd be great if someone can add me as a contributor on jira, so I can assign the jira to myself). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume
[ https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984816#comment-13984816 ] Hari Shreedharan commented on SPARK-1645: - No, Flume source and sink reside within the same JVM(http://flume.apache.org/FlumeUserGuide.html#architecture). So the receiver polls the Flume sink running on a different node (the node that runs the Flume agent pushing the data). If the node running the receiver goes down, then another worker starts up and reads from the same Flume agent. If the Flume agent goes down the receiver polls and fails to get data until the agent is back up. Improve Spark Streaming compatibility with Flume Key: SPARK-1645 URL: https://issues.apache.org/jira/browse/SPARK-1645 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Currently the following issues affect Spark Streaming and Flume compatibilty: * If a spark worker goes down, it needs to be restarted on the same node, else Flume cannot send data to it. We can fix this by adding a Flume receiver that is polls Flume, and a Flume sink that supports this. * Receiver sends acks to Flume before the driver knows about the data. The new receiver should also handle this case. * Data loss when driver goes down - This is true for any streaming ingest, not just Flume. I will file a separate jira for this and we should work on it there. This is a longer term project and requires considerable development work. I intend to start working on these soon. Any input is appreciated. (It'd be great if someone can add me as a contributor on jira, so I can assign the jira to myself). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume
[ https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984927#comment-13984927 ] Tathagata Das commented on SPARK-1645: -- Ah, I think get it now. So instead of the default push-based as it is now (where a sink is running with the receiver), you simply want to make pull-based. So if the current situation is this !http://i.imgur.com/m8oiOwl.png?1! you propose this !http://i.imgur.com/N6Ee1cb.png?1! Right? Assuming it is right, that does make it very convenient for Spark Streaming's receivers. However what does it mean for reliable receiving? When the receiver pulls the data from the source, it will acknowledge the source only when the Spark acknowledges that it has reliably saved the data? Improve Spark Streaming compatibility with Flume Key: SPARK-1645 URL: https://issues.apache.org/jira/browse/SPARK-1645 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Currently the following issues affect Spark Streaming and Flume compatibilty: * If a spark worker goes down, it needs to be restarted on the same node, else Flume cannot send data to it. We can fix this by adding a Flume receiver that is polls Flume, and a Flume sink that supports this. * Receiver sends acks to Flume before the driver knows about the data. The new receiver should also handle this case. * Data loss when driver goes down - This is true for any streaming ingest, not just Flume. I will file a separate jira for this and we should work on it there. This is a longer term project and requires considerable development work. I intend to start working on these soon. Any input is appreciated. (It'd be great if someone can add me as a contributor on jira, so I can assign the jira to myself). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume
[ https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984950#comment-13984950 ] Hari Shreedharan commented on SPARK-1645: - The first one is not exactly accurate, though it explains the idea. The second is what I suggest. As a first step, we do what is currently done - the receiver stores the data locally and acknowledges, so the reliability does not improve. Later we can make the improvement for all receivers that the data becomes persisted all the way to the driver (add a new API like storeReliably or something). We would have to do a two step Poll-ACK process. We can have the initial poll create a new request added to the ones that are pending for commit in the sink. Once the receiver has written it (for now in the current way, later reliably) - it sends out an ACK for the request id, that causes the request to be committed, so Flume can remove the events. If the receiver does not send the ACK, then the sink can have a scheduled thread come (this timeout can be specified in the flume config) and rollback and make the data available again (Flume already has the capability to make uncommitted txns to be made available if that agent fails). Improve Spark Streaming compatibility with Flume Key: SPARK-1645 URL: https://issues.apache.org/jira/browse/SPARK-1645 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Currently the following issues affect Spark Streaming and Flume compatibilty: * If a spark worker goes down, it needs to be restarted on the same node, else Flume cannot send data to it. We can fix this by adding a Flume receiver that is polls Flume, and a Flume sink that supports this. * Receiver sends acks to Flume before the driver knows about the data. The new receiver should also handle this case. * Data loss when driver goes down - This is true for any streaming ingest, not just Flume. I will file a separate jira for this and we should work on it there. This is a longer term project and requires considerable development work. I intend to start working on these soon. Any input is appreciated. (It'd be great if someone can add me as a contributor on jira, so I can assign the jira to myself). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume
[ https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983971#comment-13983971 ] Hari Shreedharan commented on SPARK-1645: - Yep, that is correct. I'd like to contribute to the design as much as possible - so perhaps we can work on the design document together. Once we start looking into this, we can definitely have to proceed on multiple fronts so we can get more of these features committed faster. Improve Spark Streaming compatibility with Flume Key: SPARK-1645 URL: https://issues.apache.org/jira/browse/SPARK-1645 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Currently the following issues affect Spark Streaming and Flume compatibilty: * If a spark worker goes down, it needs to be restarted on the same node, else Flume cannot send data to it. We can fix this by adding a Flume receiver that is polls Flume, and a Flume sink that supports this. * Receiver sends acks to Flume before the driver knows about the data. The new receiver should also handle this case. * Data loss when driver goes down - This is true for any streaming ingest, not just Flume. I will file a separate jira for this and we should work on it there. This is a longer term project and requires considerable development work. I intend to start working on these soon. Any input is appreciated. (It'd be great if someone can add me as a contributor on jira, so I can assign the jira to myself). -- This message was sent by Atlassian JIRA (v6.2#6252)