[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2016-10-08 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558550#comment-15558550
 ] 

Cody Koeninger commented on SPARK-3146:
---

SPARK-4964 / the direct stream added a messageHandler.


> Improve the flexibility of Spark Streaming Kafka API to offer user the 
> ability to process message before storing into BM
> 
>
> Key: SPARK-3146
> URL: https://issues.apache.org/jira/browse/SPARK-3146
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Saisai Shao
>
> Currently Spark Streaming Kafka API stores the key and value of each message 
> into BM for processing, potentially this may lose the flexibility for 
> different requirements:
> 1. currently topic/partition/offset information for each message is discarded 
> by KafkaInputDStream. In some scenarios people may need this information to 
> better filter the message, like SPARK-2388 described.
> 2. People may need to add timestamp for each message when feeding into Spark 
> Streaming, which can better measure the system latency.
> 3. Checkpointing the partition/offsets or others...
> So here we add a messageHandler in interface to give people the flexibility 
> to preprocess the message before storing into BM. In the meantime time this 
> improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2016-02-09 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15139250#comment-15139250
 ] 

Cody Koeninger commented on SPARK-3146:
---

I think this can be safely closed, given the messageHandler in the direct stream

> Improve the flexibility of Spark Streaming Kafka API to offer user the 
> ability to process message before storing into BM
> 
>
> Key: SPARK-3146
> URL: https://issues.apache.org/jira/browse/SPARK-3146
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Saisai Shao
>
> Currently Spark Streaming Kafka API stores the key and value of each message 
> into BM for processing, potentially this may lose the flexibility for 
> different requirements:
> 1. currently topic/partition/offset information for each message is discarded 
> by KafkaInputDStream. In some scenarios people may need this information to 
> better filter the message, like SPARK-2388 described.
> 2. People may need to add timestamp for each message when feeding into Spark 
> Streaming, which can better measure the system latency.
> 3. Checkpointing the partition/offsets or others...
> So here we add a messageHandler in interface to give people the flexibility 
> to preprocess the message before storing into BM. In the meantime time this 
> improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2014-12-24 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258593#comment-14258593
 ] 

Tathagata Das commented on SPARK-3146:
--

I am a little wary of adding any functionality to a single input stream that we 
would like to be added to all input streams in a general way in the future. It 
leads to a lot of API mess. However this is also true that I dont see (yet) an 
obvious way to add the general interceptor (M = Iterable[T]) pattern to all 
receiver input streams. So I am conflicted on this. I am going to create a JIRA 
for this feature, on which we as a community can brainstorm different ways to 
create the feature. 



 Improve the flexibility of Spark Streaming Kafka API to offer user the 
 ability to process message before storing into BM
 

 Key: SPARK-3146
 URL: https://issues.apache.org/jira/browse/SPARK-3146
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Saisai Shao

 Currently Spark Streaming Kafka API stores the key and value of each message 
 into BM for processing, potentially this may lose the flexibility for 
 different requirements:
 1. currently topic/partition/offset information for each message is discarded 
 by KafkaInputDStream. In some scenarios people may need this information to 
 better filter the message, like SPARK-2388 described.
 2. People may need to add timestamp for each message when feeding into Spark 
 Streaming, which can better measure the system latency.
 3. Checkpointing the partition/offsets or others...
 So here we add a messageHandler in interface to give people the flexibility 
 to preprocess the message before storing into BM. In the meantime time this 
 improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2014-12-24 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258598#comment-14258598
 ] 

Tathagata Das commented on SPARK-3146:
--

[~c...@koeninger.org] [~jerryshao] If either of you wants to take a crack at 
the general interceptor pattern for all receivers, here is the JIRA - 
SPARK-4960 

 Improve the flexibility of Spark Streaming Kafka API to offer user the 
 ability to process message before storing into BM
 

 Key: SPARK-3146
 URL: https://issues.apache.org/jira/browse/SPARK-3146
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Saisai Shao

 Currently Spark Streaming Kafka API stores the key and value of each message 
 into BM for processing, potentially this may lose the flexibility for 
 different requirements:
 1. currently topic/partition/offset information for each message is discarded 
 by KafkaInputDStream. In some scenarios people may need this information to 
 better filter the message, like SPARK-2388 described.
 2. People may need to add timestamp for each message when feeding into Spark 
 Streaming, which can better measure the system latency.
 3. Checkpointing the partition/offsets or others...
 So here we add a messageHandler in interface to give people the flexibility 
 to preprocess the message before storing into BM. In the meantime time this 
 improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2014-12-19 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254100#comment-14254100
 ] 

Cody Koeninger commented on SPARK-3146:
---

This is a real problem for production use, not an abstract concern about what 
users might or might not do if you gave them full access to power that they 
already should have had.

The patch is a working implementation that solves the problem, we've been using 
it in production.

I think the patch needs to be improved to take a function MessageAndMetadata = 
Iterable[R] instead of the current MessageAndMetadata = R (in other words it 
needs to be a flatMap operation instead of the current map).

If for some reason this needs to be added to all receivers, and it needs to be 
done via a setter instead of a constructor argument, I don't understand why, 
but it will solve the problem, so I'm all for it.

If it really needs to be a function (Any, Any) = Iterable[R], I think that's 
worse from a user interface point of view, but again it will solve the problem, 
so I'm all for it.

If someone with the power to merge PRs will say which of those options they 
would be willing to accept, I'll happily implement it.

What I definitely don't want to see is for this ticket to languish for another 
6 months, during which time I keep having to do manual maintenance of a patched 
Spark.

 Improve the flexibility of Spark Streaming Kafka API to offer user the 
 ability to process message before storing into BM
 

 Key: SPARK-3146
 URL: https://issues.apache.org/jira/browse/SPARK-3146
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Saisai Shao

 Currently Spark Streaming Kafka API stores the key and value of each message 
 into BM for processing, potentially this may lose the flexibility for 
 different requirements:
 1. currently topic/partition/offset information for each message is discarded 
 by KafkaInputDStream. In some scenarios people may need this information to 
 better filter the message, like SPARK-2388 described.
 2. People may need to add timestamp for each message when feeding into Spark 
 Streaming, which can better measure the system latency.
 3. Checkpointing the partition/offsets or others...
 So here we add a messageHandler in interface to give people the flexibility 
 to preprocess the message before storing into BM. In the meantime time this 
 improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2014-12-19 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254123#comment-14254123
 ] 

Hari Shreedharan commented on SPARK-3146:
-

For now, I am ok with just adding it to individual DStreams, not necessarily 
evverything - though that would be the best case. Unfortunately, that would 
break the current interface (unless of course we add an implementation) - 
though since each event is received by individual receivers, using this means 
the receiver would need to call a method like intercept or something. Either 
way, I am also a strong +1 on getting this in soon!

 Improve the flexibility of Spark Streaming Kafka API to offer user the 
 ability to process message before storing into BM
 

 Key: SPARK-3146
 URL: https://issues.apache.org/jira/browse/SPARK-3146
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Saisai Shao

 Currently Spark Streaming Kafka API stores the key and value of each message 
 into BM for processing, potentially this may lose the flexibility for 
 different requirements:
 1. currently topic/partition/offset information for each message is discarded 
 by KafkaInputDStream. In some scenarios people may need this information to 
 better filter the message, like SPARK-2388 described.
 2. People may need to add timestamp for each message when feeding into Spark 
 Streaming, which can better measure the system latency.
 3. Checkpointing the partition/offsets or others...
 So here we add a messageHandler in interface to give people the flexibility 
 to preprocess the message before storing into BM. In the meantime time this 
 improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2014-12-18 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252113#comment-14252113
 ] 

Hari Shreedharan commented on SPARK-3146:
-

I am a +1 for using something similar to the interceptor pattern. I have worked 
quite a bit on Flume to know that it is an incredibly useful feature especially 
for filtering and enrichment which can usually be pretty lightweight. If there 
is enough demand, I can help with this - perhaps implement it on the lines of 
Flume's Interceptor implementation?

 Improve the flexibility of Spark Streaming Kafka API to offer user the 
 ability to process message before storing into BM
 

 Key: SPARK-3146
 URL: https://issues.apache.org/jira/browse/SPARK-3146
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Saisai Shao

 Currently Spark Streaming Kafka API stores the key and value of each message 
 into BM for processing, potentially this may lose the flexibility for 
 different requirements:
 1. currently topic/partition/offset information for each message is discarded 
 by KafkaInputDStream. In some scenarios people may need this information to 
 better filter the message, like SPARK-2388 described.
 2. People may need to add timestamp for each message when feeding into Spark 
 Streaming, which can better measure the system latency.
 3. Checkpointing the partition/offsets or others...
 So here we add a messageHandler in interface to give people the flexibility 
 to preprocess the message before storing into BM. In the meantime time this 
 improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2014-12-18 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252146#comment-14252146
 ] 

Cody Koeninger commented on SPARK-3146:
---

From my point of view, the interceptor pattern is little more than an overly 
complex way of spelling flatMap.  What does it actually gain you?

The specific issue here is two-fold:

1. Exposing MessageAndMetadata to users, not just (key, message).  This is a 
kafka-specific concern, not applicable to all types of receivers.  If it had 
been originally written as a DStream[MessageAndMetadata], there would be less 
of a need for this patch.

2. Controlling timing, and thus checkpointing, based on the contents of the 
individual item.  I can see this being applicable to many types of receivers, 
but not all.  For instance, if I sleep inside this function, what happens for 
Kafka is not the same as what happens for e.g. Twitter.

 Improve the flexibility of Spark Streaming Kafka API to offer user the 
 ability to process message before storing into BM
 

 Key: SPARK-3146
 URL: https://issues.apache.org/jira/browse/SPARK-3146
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Saisai Shao

 Currently Spark Streaming Kafka API stores the key and value of each message 
 into BM for processing, potentially this may lose the flexibility for 
 different requirements:
 1. currently topic/partition/offset information for each message is discarded 
 by KafkaInputDStream. In some scenarios people may need this information to 
 better filter the message, like SPARK-2388 described.
 2. People may need to add timestamp for each message when feeding into Spark 
 Streaming, which can better measure the system latency.
 3. Checkpointing the partition/offsets or others...
 So here we add a messageHandler in interface to give people the flexibility 
 to preprocess the message before storing into BM. In the meantime time this 
 improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2014-12-18 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252160#comment-14252160
 ] 

Hari Shreedharan commented on SPARK-3146:
-

The difference between interceptor pattern vs flatMap would be latency. Rather 
than have the operation run on the Spark cluster every batch, the operation 
would happen when the data is received and thus any intercepted messages would 
change/be filtered before the batch is generated. Sufficient checks would need 
to be present to ensure that all user code does not end up in an intercept call.

I am not exactly sure what (1) buys you, but (2) seems to be easily possible 
via the interceptor pattern no?

 Improve the flexibility of Spark Streaming Kafka API to offer user the 
 ability to process message before storing into BM
 

 Key: SPARK-3146
 URL: https://issues.apache.org/jira/browse/SPARK-3146
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Saisai Shao

 Currently Spark Streaming Kafka API stores the key and value of each message 
 into BM for processing, potentially this may lose the flexibility for 
 different requirements:
 1. currently topic/partition/offset information for each message is discarded 
 by KafkaInputDStream. In some scenarios people may need this information to 
 better filter the message, like SPARK-2388 described.
 2. People may need to add timestamp for each message when feeding into Spark 
 Streaming, which can better measure the system latency.
 3. Checkpointing the partition/offsets or others...
 So here we add a messageHandler in interface to give people the flexibility 
 to preprocess the message before storing into BM. In the meantime time this 
 improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2014-12-18 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252210#comment-14252210
 ] 

Cody Koeninger commented on SPARK-3146:
---

(1) is important because MessageAndMetadata contains more information  than 
just a key and value, namely topic, partition and offset.  There's no reason to 
hide that information from users, and exposing it to them lets them fix all 
kinds of problems for themselves, namely SPARK-2388, manual offset management, 
and things I can't imagine yet.

(2) Regarding interceptor, maybe this is just a semantic difference.  The 
function you're talking about _is_ flatMap, namely (Container[A], (A) = 
Container[B]) : Container[B] 

The important thing is as you note, that it runs before storage and offset 
checkpointing happens.  I'd find it unfortunate if the api was exposed in terms 
of an Interceptor interface that users had to implement and then use a setter 
for, rather than just simply passing in a function to the constructor... but 
ultimately, what matters is that the functionality is exposed.

 Improve the flexibility of Spark Streaming Kafka API to offer user the 
 ability to process message before storing into BM
 

 Key: SPARK-3146
 URL: https://issues.apache.org/jira/browse/SPARK-3146
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Saisai Shao

 Currently Spark Streaming Kafka API stores the key and value of each message 
 into BM for processing, potentially this may lose the flexibility for 
 different requirements:
 1. currently topic/partition/offset information for each message is discarded 
 by KafkaInputDStream. In some scenarios people may need this information to 
 better filter the message, like SPARK-2388 described.
 2. People may need to add timestamp for each message when feeding into Spark 
 Streaming, which can better measure the system latency.
 3. Checkpointing the partition/offsets or others...
 So here we add a messageHandler in interface to give people the flexibility 
 to preprocess the message before storing into BM. In the meantime time this 
 improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2014-12-18 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252256#comment-14252256
 ] 

Hari Shreedharan commented on SPARK-3146:
-

I understand what you mean by flatMap (the reason I don't want to call it 
flatMap is that we already have a flatMap which behaves differently in terms of 
when and where it is executed). 

Interceptors can simply be functions passed in (the reason it is not a function 
in Flume is because Flume is all Java - functions could not be passed around 
till Java 8), the idea being the same. I think we both agree on the idea, we 
can discuss how it is exposed in more detail once we have more consensus on if 
it makes sense.

For (1), I see that metadata is something that might be helpful. What if we 
pass the metadata to the interceptor method too (see how the BlockGenerator 
gets it now?). It is better than rewriting it as a DStream[MessageAndMetadata] 
- which may not be as easy to get done.

 Improve the flexibility of Spark Streaming Kafka API to offer user the 
 ability to process message before storing into BM
 

 Key: SPARK-3146
 URL: https://issues.apache.org/jira/browse/SPARK-3146
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Saisai Shao

 Currently Spark Streaming Kafka API stores the key and value of each message 
 into BM for processing, potentially this may lose the flexibility for 
 different requirements:
 1. currently topic/partition/offset information for each message is discarded 
 by KafkaInputDStream. In some scenarios people may need this information to 
 better filter the message, like SPARK-2388 described.
 2. People may need to add timestamp for each message when feeding into Spark 
 Streaming, which can better measure the system latency.
 3. Checkpointing the partition/offsets or others...
 So here we add a messageHandler in interface to give people the flexibility 
 to preprocess the message before storing into BM. In the meantime time this 
 improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2014-12-18 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252299#comment-14252299
 ] 

Cody Koeninger commented on SPARK-3146:
---

Yes, for the specific case of kafka, regardless of the name of the api, the 
user supplied function needs to be of type

MessageAndMetadata = Iterable[R]





 Improve the flexibility of Spark Streaming Kafka API to offer user the 
 ability to process message before storing into BM
 

 Key: SPARK-3146
 URL: https://issues.apache.org/jira/browse/SPARK-3146
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Saisai Shao

 Currently Spark Streaming Kafka API stores the key and value of each message 
 into BM for processing, potentially this may lose the flexibility for 
 different requirements:
 1. currently topic/partition/offset information for each message is discarded 
 by KafkaInputDStream. In some scenarios people may need this information to 
 better filter the message, like SPARK-2388 described.
 2. People may need to add timestamp for each message when feeding into Spark 
 Streaming, which can better measure the system latency.
 3. Checkpointing the partition/offsets or others...
 So here we add a messageHandler in interface to give people the flexibility 
 to preprocess the message before storing into BM. In the meantime time this 
 improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2014-12-18 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252319#comment-14252319
 ] 

Hari Shreedharan commented on SPARK-3146:
-

If you look at the addDataWithCallback method, we pass (data, metadata) to the 
BlockGenerator, where data here is K,V and metadata is the (topicAndPartition, 
msgAndMetadata.offset). Would passing data, metadata to an intercept method be 
useful?

 Improve the flexibility of Spark Streaming Kafka API to offer user the 
 ability to process message before storing into BM
 

 Key: SPARK-3146
 URL: https://issues.apache.org/jira/browse/SPARK-3146
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Saisai Shao

 Currently Spark Streaming Kafka API stores the key and value of each message 
 into BM for processing, potentially this may lose the flexibility for 
 different requirements:
 1. currently topic/partition/offset information for each message is discarded 
 by KafkaInputDStream. In some scenarios people may need this information to 
 better filter the message, like SPARK-2388 described.
 2. People may need to add timestamp for each message when feeding into Spark 
 Streaming, which can better measure the system latency.
 3. Checkpointing the partition/offsets or others...
 So here we add a messageHandler in interface to give people the flexibility 
 to preprocess the message before storing into BM. In the meantime time this 
 improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2014-12-18 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252397#comment-14252397
 ] 

Cody Koeninger commented on SPARK-3146:
---

((K, V), (topicAndPartition, offset)) is all of the public information in a 
MessageAndMetadata object, aside from the decoders, so that's probably ok.  If 
you're trying to come up with a uniform interface (Any, Any) = Iterable[R] 
doesn't seem much better than Any = Iterable[R] though.

Also, just to be clear, the particular example calls you're talking about only 
exist in ReliableKafkaReceiver, not KafkaInputDStream, and we need something 
that works for both.

 Improve the flexibility of Spark Streaming Kafka API to offer user the 
 ability to process message before storing into BM
 

 Key: SPARK-3146
 URL: https://issues.apache.org/jira/browse/SPARK-3146
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Saisai Shao

 Currently Spark Streaming Kafka API stores the key and value of each message 
 into BM for processing, potentially this may lose the flexibility for 
 different requirements:
 1. currently topic/partition/offset information for each message is discarded 
 by KafkaInputDStream. In some scenarios people may need this information to 
 better filter the message, like SPARK-2388 described.
 2. People may need to add timestamp for each message when feeding into Spark 
 Streaming, which can better measure the system latency.
 3. Checkpointing the partition/offsets or others...
 So here we add a messageHandler in interface to give people the flexibility 
 to preprocess the message before storing into BM. In the meantime time this 
 improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2014-12-18 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252742#comment-14252742
 ] 

Saisai Shao commented on SPARK-3146:


Hi all, thanks a lot for your comments. My original purpose of this proposal is 
to solve SPARK-2388 and make this solution more general, I'm not sure if other 
receivers have such requirements, also is it reasonable to add this interceptor 
to some receivers like socket, file ?

We can further improve the flexibility of streaming API by this interceptors, 
like filter out some unwanted messages, track the latency by timestamping the 
message once received. But the problem is that such flexibility will make user 
abusing this API to process the data once received, rather than by transforming 
function, this potentially disobey the semantics of Spark Streaming.

I think probably we should redesign the whole thing to make it both flexible 
and meaningful.


 Improve the flexibility of Spark Streaming Kafka API to offer user the 
 ability to process message before storing into BM
 

 Key: SPARK-3146
 URL: https://issues.apache.org/jira/browse/SPARK-3146
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Saisai Shao

 Currently Spark Streaming Kafka API stores the key and value of each message 
 into BM for processing, potentially this may lose the flexibility for 
 different requirements:
 1. currently topic/partition/offset information for each message is discarded 
 by KafkaInputDStream. In some scenarios people may need this information to 
 better filter the message, like SPARK-2388 described.
 2. People may need to add timestamp for each message when feeding into Spark 
 Streaming, which can better measure the system latency.
 3. Checkpointing the partition/offsets or others...
 So here we add a messageHandler in interface to give people the flexibility 
 to preprocess the message before storing into BM. In the meantime time this 
 improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2014-11-03 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195146#comment-14195146
 ] 

Cody Koeninger commented on SPARK-3146:
---

I think this PR is an elegant way to solve SPARK-2388, which is an otherwise 
blocking bug for our usage of kafka.

Absent a concrete design for doing something equivalent for all InputDStreams, 
I'd encourage merging it. 

 Improve the flexibility of Spark Streaming Kafka API to offer user the 
 ability to process message before storing into BM
 

 Key: SPARK-3146
 URL: https://issues.apache.org/jira/browse/SPARK-3146
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Saisai Shao

 Currently Spark Streaming Kafka API stores the key and value of each message 
 into BM for processing, potentially this may lose the flexibility for 
 different requirements:
 1. currently topic/partition/offset information for each message is discarded 
 by KafkaInputDStream. In some scenarios people may need this information to 
 better filter the message, like SPARK-2388 described.
 2. People may need to add timestamp for each message when feeding into Spark 
 Streaming, which can better measure the system latency.
 3. Checkpointing the partition/offsets or others...
 So here we add a messageHandler in interface to give people the flexibility 
 to preprocess the message before storing into BM. In the meantime time this 
 improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2014-09-02 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119128#comment-14119128
 ] 

Saisai Shao commented on SPARK-3146:


Hi [~tdas],

Sorry for late response, thanks a lot for your reply. My initial thought was to 
add flexibility to Spark Streaming Kafka's API, because in some cases people 
needs to know the topic of this message or others. Maybe it's worthy to change 
to a generic way to all the input dstreams. I will carefully think about the 
implementations. 

 Improve the flexibility of Spark Streaming Kafka API to offer user the 
 ability to process message before storing into BM
 

 Key: SPARK-3146
 URL: https://issues.apache.org/jira/browse/SPARK-3146
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Saisai Shao

 Currently Spark Streaming Kafka API stores the key and value of each message 
 into BM for processing, potentially this may lose the flexibility for 
 different requirements:
 1. currently topic/partition/offset information for each message is discarded 
 by KafkaInputDStream. In some scenarios people may need this information to 
 better filter the message, like SPARK-2388 described.
 2. People may need to add timestamp for each message when feeding into Spark 
 Streaming, which can better measure the system latency.
 3. Checkpointing the partition/offsets or others...
 So here we add a messageHandler in interface to give people the flexibility 
 to preprocess the message before storing into BM. In the meantime time this 
 improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2014-08-26 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111614#comment-14111614
 ] 

Tathagata Das commented on SPARK-3146:
--

I think this can be further generalized to the interceptor pattern of flume, 
that is applicable to all types of receivers. So I feel that this should be 
implemented in such a way that generally applies to all input dstreams at the 
API level. For example, 

 InputDStream.setInterceptorFunction(function: (T) = Iterable[T])

Then all receivers should be able to apply a filter function. 

[~jerryshao] What do you think?

 Improve the flexibility of Spark Streaming Kafka API to offer user the 
 ability to process message before storing into BM
 

 Key: SPARK-3146
 URL: https://issues.apache.org/jira/browse/SPARK-3146
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Saisai Shao

 Currently Spark Streaming Kafka API stores the key and value of each message 
 into BM for processing, potentially this may lose the flexibility for 
 different requirements:
 1. currently topic/partition/offset information for each message is discarded 
 by KafkaInputDStream. In some scenarios people may need this information to 
 better filter the message, like SPARK-2388 described.
 2. People may need to add timestamp for each message when feeding into Spark 
 Streaming, which can better measure the system latency.
 3. Checkpointing the partition/offsets or others...
 So here we add a messageHandler in interface to give people the flexibility 
 to preprocess the message before storing into BM. In the meantime time this 
 improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2014-08-20 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103486#comment-14103486
 ] 

Saisai Shao commented on SPARK-3146:


This issue can actually solve the problem mentioned in SPARK-2388, besides 
offer a more general way.

 Improve the flexibility of Spark Streaming Kafka API to offer user the 
 ability to process message before storing into BM
 

 Key: SPARK-3146
 URL: https://issues.apache.org/jira/browse/SPARK-3146
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2
Reporter: Saisai Shao

 Currently Spark Streaming Kafka API stores the key and value of each message 
 into BM for processing, potentially this may lose the flexibility for 
 different requirements:
 1. currently topic/partition/offset information for each message is discarded 
 by KafkaInputDStream. In some scenarios people may need this information to 
 better filter the message, like SPARK-2388 described.
 2. People may need to add timestamp for each message when feeding into Spark 
 Streaming, which can better measure the system latency.
 So here we add a messageHandler in interface to give people the flexibility 
 to preprocess the message before storing into BM. In the meantime time this 
 improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2014-08-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103520#comment-14103520
 ] 

Apache Spark commented on SPARK-3146:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/2053

 Improve the flexibility of Spark Streaming Kafka API to offer user the 
 ability to process message before storing into BM
 

 Key: SPARK-3146
 URL: https://issues.apache.org/jira/browse/SPARK-3146
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2
Reporter: Saisai Shao

 Currently Spark Streaming Kafka API stores the key and value of each message 
 into BM for processing, potentially this may lose the flexibility for 
 different requirements:
 1. currently topic/partition/offset information for each message is discarded 
 by KafkaInputDStream. In some scenarios people may need this information to 
 better filter the message, like SPARK-2388 described.
 2. People may need to add timestamp for each message when feeding into Spark 
 Streaming, which can better measure the system latency.
 3. Checkpointing the partition/offsets or others...
 So here we add a messageHandler in interface to give people the flexibility 
 to preprocess the message before storing into BM. In the meantime time this 
 improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org