subject:"\[jira\] \[Commented\] \(SPARK\-1645\) Improve Spark Streaming compatibility with Flume"

[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-30 Thread Tathagata Das (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986220#comment-13986220
]

Tathagata Das commented on SPARK-1645:
--

This makes sense from the integration point of view. Though I wonder from
thePOV of Flume's deployment configuration does it make things more complex?
Like for example, if someone has a the flume system already setup, in the
current situation, the configuration change to add a new sink seems standard
and easy. However, in the proposed model, since Flume's data pushing node has
to run a sink, how much complicated does this configuration process get?

Improve Spark Streaming compatibility with Flume

Key: SPARK-1645
URL: https://issues.apache.org/jira/browse/SPARK-1645
Project: Spark
Issue Type: Bug
Components: Streaming
Reporter: Hari Shreedharan

Currently the following issues affect Spark Streaming and Flume compatibilty:
* If a spark worker goes down, it needs to be restarted on the same node,
else Flume cannot send data to it. We can fix this by adding a Flume receiver
that is polls Flume, and a Flume sink that supports this.
* Receiver sends acks to Flume before the driver knows about the data. The
new receiver should also handle this case.
* Data loss when driver goes down - This is true for any streaming ingest,
not just Flume. I will file a separate jira for this and we should work on it
there. This is a longer term project and requires considerable development
work.
I intend to start working on these soon. Any input is appreciated. (It'd be
great if someone can add me as a contributor on jira, so I can assign the
jira to myself).

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-30 Thread Hari Shreedharan (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986275#comment-13986275
]

Hari Shreedharan commented on SPARK-1645:
-

Yes, it is better to add new methods rather than reusing the old ones and
confusing existing users.

In fact, I think we should add a new receiver for the time being and only
deprecate the old one initially. We can remove the old one in a later release.

Improve Spark Streaming compatibility with Flume

Key: SPARK-1645
URL: https://issues.apache.org/jira/browse/SPARK-1645
Project: Spark
Issue Type: Bug
Components: Streaming
Reporter: Hari Shreedharan

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-30 Thread Tathagata Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986312#comment-13986312
 ] 

Tathagata Das commented on SPARK-1645:
--

I agree. New receiver for the new API is the best way. 

 Improve Spark Streaming compatibility with Flume
 

 Key: SPARK-1645
 URL: https://issues.apache.org/jira/browse/SPARK-1645
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan

 Currently the following issues affect Spark Streaming and Flume compatibilty:
 * If a spark worker goes down, it needs to be restarted on the same node, 
 else Flume cannot send data to it. We can fix this by adding a Flume receiver 
 that is polls Flume, and a Flume sink that supports this.
 * Receiver sends acks to Flume before the driver knows about the data. The 
 new receiver should also handle this case.
 * Data loss when driver goes down - This is true for any streaming ingest, 
 not just Flume. I will file a separate jira for this and we should work on it 
 there. This is a longer term project and requires considerable development 
 work.
 I intend to start working on these soon. Any input is appreciated. (It'd be 
 great if someone can add me as a contributor on jira, so I can assign the 
 jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-29 Thread Hari Shreedharan (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984734#comment-13984734
]

Hari Shreedharan commented on SPARK-1645:
-

Yes, so I have a rough design for that in mind. The idea is to add a sink which
plugs into Flume, which gets polled by the Spark receiver. That way, even if
the node on which the worker is running fails, the receiver on another node can
poll the sink and pull data. From the Flume point of view, the sink does not
conform to the definition of standard sinks (all Flume sinks are push only),
but it can be written such that we don't lose data. Later if/when Flume adds
support for pollable sinks this sink can be ported.

Improve Spark Streaming compatibility with Flume

Key: SPARK-1645
URL: https://issues.apache.org/jira/browse/SPARK-1645
Project: Spark
Issue Type: Bug
Components: Streaming
Reporter: Hari Shreedharan

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-29 Thread Tathagata Das (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984768#comment-13984768
]

Tathagata Das commented on SPARK-1645:
--

Let me understand this. Is this sink going to run as a separate process outside
the Spark executor? If it is running as a thread in the same executor process
as the receiver, then that is no better than what it is now, as it will fail
when the executor fails. So I am guessing it will be a process outside the
executor. Doesnt introduce the headache of managing that process separately?
And what happens when the whole worker node dies?

Improve Spark Streaming compatibility with Flume

Key: SPARK-1645
URL: https://issues.apache.org/jira/browse/SPARK-1645
Project: Spark
Issue Type: Bug
Components: Streaming
Reporter: Hari Shreedharan

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-29 Thread Hari Shreedharan (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984816#comment-13984816
]

Hari Shreedharan commented on SPARK-1645:
-

No, Flume source and sink reside within the same
JVM(http://flume.apache.org/FlumeUserGuide.html#architecture). So the receiver
polls the Flume sink running on a different node (the node that runs the Flume
agent pushing the data). If the node running the receiver goes down, then
another worker starts up and reads from the same Flume agent. If the Flume
agent goes down the receiver polls and fails to get data until the agent is
back up.

Improve Spark Streaming compatibility with Flume

Key: SPARK-1645
URL: https://issues.apache.org/jira/browse/SPARK-1645
Project: Spark
Issue Type: Bug
Components: Streaming
Reporter: Hari Shreedharan

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-29 Thread Tathagata Das (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984927#comment-13984927
]

Tathagata Das commented on SPARK-1645:
--

Ah, I think get it now. So instead of the default push-based as it is now
(where a sink is running with the receiver), you simply want to make
pull-based.

So if the current situation is this

!http://i.imgur.com/m8oiOwl.png?1!

you propose this

!http://i.imgur.com/N6Ee1cb.png?1!

Right?
Assuming it is right, that does make it very convenient for Spark Streaming's
receivers. However what does it mean for reliable receiving? When the receiver
pulls the data from the source, it will acknowledge the source only when the
Spark acknowledges that it has reliably saved the data?

Improve Spark Streaming compatibility with Flume

Key: SPARK-1645
URL: https://issues.apache.org/jira/browse/SPARK-1645
Project: Spark
Issue Type: Bug
Components: Streaming
Reporter: Hari Shreedharan

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-29 Thread Hari Shreedharan (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984950#comment-13984950
]

Hari Shreedharan commented on SPARK-1645:
-

The first one is not exactly accurate, though it explains the idea. The second
is what I suggest.

As a first step, we do what is currently done - the receiver stores the data
locally and acknowledges, so the reliability does not improve. Later we can
make the improvement for all receivers that the data becomes persisted all the
way to the driver (add a new API like storeReliably or something).

We would have to do a two step Poll-ACK process. We can have the initial poll
create a new request added to the ones that are pending for commit in the sink.
Once the receiver has written it (for now in the current way, later reliably) -
it sends out an ACK for the request id, that causes the request to be
committed, so Flume can remove the events. If the receiver does not send the
ACK, then the sink can have a scheduled thread come (this timeout can be
specified in the flume config) and rollback and make the data available again
(Flume already has the capability to make uncommitted txns to be made available
if that agent fails).

Improve Spark Streaming compatibility with Flume

Key: SPARK-1645
URL: https://issues.apache.org/jira/browse/SPARK-1645
Project: Spark
Issue Type: Bug
Components: Streaming
Reporter: Hari Shreedharan

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-28 Thread Hari Shreedharan (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983971#comment-13983971
]

Hari Shreedharan commented on SPARK-1645:
-

Yep, that is correct. I'd like to contribute to the design as much as possible
- so perhaps we can work on the design document together. Once we start looking
into this, we can definitely have to proceed on multiple fronts so we can get
more of these features committed faster.

Improve Spark Streaming compatibility with Flume

Key: SPARK-1645
URL: https://issues.apache.org/jira/browse/SPARK-1645
Project: Spark
Issue Type: Bug
Components: Streaming
Reporter: Hari Shreedharan

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

9 matches

Site Navigation

Mail list logo

Footer information