[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM
[ https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558550#comment-15558550 ] Cody Koeninger commented on SPARK-3146: --- SPARK-4964 / the direct stream added a messageHandler. > Improve the flexibility of Spark Streaming Kafka API to offer user the > ability to process message before storing into BM > > > Key: SPARK-3146 > URL: https://issues.apache.org/jira/browse/SPARK-3146 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.2, 1.1.0 >Reporter: Saisai Shao > > Currently Spark Streaming Kafka API stores the key and value of each message > into BM for processing, potentially this may lose the flexibility for > different requirements: > 1. currently topic/partition/offset information for each message is discarded > by KafkaInputDStream. In some scenarios people may need this information to > better filter the message, like SPARK-2388 described. > 2. People may need to add timestamp for each message when feeding into Spark > Streaming, which can better measure the system latency. > 3. Checkpointing the partition/offsets or others... > So here we add a messageHandler in interface to give people the flexibility > to preprocess the message before storing into BM. In the meantime time this > improvement keep compatible with current API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM
[ https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15139250#comment-15139250 ] Cody Koeninger commented on SPARK-3146: --- I think this can be safely closed, given the messageHandler in the direct stream > Improve the flexibility of Spark Streaming Kafka API to offer user the > ability to process message before storing into BM > > > Key: SPARK-3146 > URL: https://issues.apache.org/jira/browse/SPARK-3146 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.2, 1.1.0 >Reporter: Saisai Shao > > Currently Spark Streaming Kafka API stores the key and value of each message > into BM for processing, potentially this may lose the flexibility for > different requirements: > 1. currently topic/partition/offset information for each message is discarded > by KafkaInputDStream. In some scenarios people may need this information to > better filter the message, like SPARK-2388 described. > 2. People may need to add timestamp for each message when feeding into Spark > Streaming, which can better measure the system latency. > 3. Checkpointing the partition/offsets or others... > So here we add a messageHandler in interface to give people the flexibility > to preprocess the message before storing into BM. In the meantime time this > improvement keep compatible with current API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM
[ https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258593#comment-14258593 ] Tathagata Das commented on SPARK-3146: -- I am a little wary of adding any functionality to a single input stream that we would like to be added to all input streams in a general way in the future. It leads to a lot of API mess. However this is also true that I dont see (yet) an obvious way to add the general interceptor (M = Iterable[T]) pattern to all receiver input streams. So I am conflicted on this. I am going to create a JIRA for this feature, on which we as a community can brainstorm different ways to create the feature. Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM Key: SPARK-3146 URL: https://issues.apache.org/jira/browse/SPARK-3146 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.0 Reporter: Saisai Shao Currently Spark Streaming Kafka API stores the key and value of each message into BM for processing, potentially this may lose the flexibility for different requirements: 1. currently topic/partition/offset information for each message is discarded by KafkaInputDStream. In some scenarios people may need this information to better filter the message, like SPARK-2388 described. 2. People may need to add timestamp for each message when feeding into Spark Streaming, which can better measure the system latency. 3. Checkpointing the partition/offsets or others... So here we add a messageHandler in interface to give people the flexibility to preprocess the message before storing into BM. In the meantime time this improvement keep compatible with current API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM
[ https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258598#comment-14258598 ] Tathagata Das commented on SPARK-3146: -- [~c...@koeninger.org] [~jerryshao] If either of you wants to take a crack at the general interceptor pattern for all receivers, here is the JIRA - SPARK-4960 Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM Key: SPARK-3146 URL: https://issues.apache.org/jira/browse/SPARK-3146 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.0 Reporter: Saisai Shao Currently Spark Streaming Kafka API stores the key and value of each message into BM for processing, potentially this may lose the flexibility for different requirements: 1. currently topic/partition/offset information for each message is discarded by KafkaInputDStream. In some scenarios people may need this information to better filter the message, like SPARK-2388 described. 2. People may need to add timestamp for each message when feeding into Spark Streaming, which can better measure the system latency. 3. Checkpointing the partition/offsets or others... So here we add a messageHandler in interface to give people the flexibility to preprocess the message before storing into BM. In the meantime time this improvement keep compatible with current API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM
[ https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254100#comment-14254100 ] Cody Koeninger commented on SPARK-3146: --- This is a real problem for production use, not an abstract concern about what users might or might not do if you gave them full access to power that they already should have had. The patch is a working implementation that solves the problem, we've been using it in production. I think the patch needs to be improved to take a function MessageAndMetadata = Iterable[R] instead of the current MessageAndMetadata = R (in other words it needs to be a flatMap operation instead of the current map). If for some reason this needs to be added to all receivers, and it needs to be done via a setter instead of a constructor argument, I don't understand why, but it will solve the problem, so I'm all for it. If it really needs to be a function (Any, Any) = Iterable[R], I think that's worse from a user interface point of view, but again it will solve the problem, so I'm all for it. If someone with the power to merge PRs will say which of those options they would be willing to accept, I'll happily implement it. What I definitely don't want to see is for this ticket to languish for another 6 months, during which time I keep having to do manual maintenance of a patched Spark. Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM Key: SPARK-3146 URL: https://issues.apache.org/jira/browse/SPARK-3146 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.0 Reporter: Saisai Shao Currently Spark Streaming Kafka API stores the key and value of each message into BM for processing, potentially this may lose the flexibility for different requirements: 1. currently topic/partition/offset information for each message is discarded by KafkaInputDStream. In some scenarios people may need this information to better filter the message, like SPARK-2388 described. 2. People may need to add timestamp for each message when feeding into Spark Streaming, which can better measure the system latency. 3. Checkpointing the partition/offsets or others... So here we add a messageHandler in interface to give people the flexibility to preprocess the message before storing into BM. In the meantime time this improvement keep compatible with current API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM
[ https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14254123#comment-14254123 ] Hari Shreedharan commented on SPARK-3146: - For now, I am ok with just adding it to individual DStreams, not necessarily evverything - though that would be the best case. Unfortunately, that would break the current interface (unless of course we add an implementation) - though since each event is received by individual receivers, using this means the receiver would need to call a method like intercept or something. Either way, I am also a strong +1 on getting this in soon! Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM Key: SPARK-3146 URL: https://issues.apache.org/jira/browse/SPARK-3146 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.0 Reporter: Saisai Shao Currently Spark Streaming Kafka API stores the key and value of each message into BM for processing, potentially this may lose the flexibility for different requirements: 1. currently topic/partition/offset information for each message is discarded by KafkaInputDStream. In some scenarios people may need this information to better filter the message, like SPARK-2388 described. 2. People may need to add timestamp for each message when feeding into Spark Streaming, which can better measure the system latency. 3. Checkpointing the partition/offsets or others... So here we add a messageHandler in interface to give people the flexibility to preprocess the message before storing into BM. In the meantime time this improvement keep compatible with current API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM
[ https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252113#comment-14252113 ] Hari Shreedharan commented on SPARK-3146: - I am a +1 for using something similar to the interceptor pattern. I have worked quite a bit on Flume to know that it is an incredibly useful feature especially for filtering and enrichment which can usually be pretty lightweight. If there is enough demand, I can help with this - perhaps implement it on the lines of Flume's Interceptor implementation? Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM Key: SPARK-3146 URL: https://issues.apache.org/jira/browse/SPARK-3146 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.0 Reporter: Saisai Shao Currently Spark Streaming Kafka API stores the key and value of each message into BM for processing, potentially this may lose the flexibility for different requirements: 1. currently topic/partition/offset information for each message is discarded by KafkaInputDStream. In some scenarios people may need this information to better filter the message, like SPARK-2388 described. 2. People may need to add timestamp for each message when feeding into Spark Streaming, which can better measure the system latency. 3. Checkpointing the partition/offsets or others... So here we add a messageHandler in interface to give people the flexibility to preprocess the message before storing into BM. In the meantime time this improvement keep compatible with current API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM
[ https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252146#comment-14252146 ] Cody Koeninger commented on SPARK-3146: --- From my point of view, the interceptor pattern is little more than an overly complex way of spelling flatMap. What does it actually gain you? The specific issue here is two-fold: 1. Exposing MessageAndMetadata to users, not just (key, message). This is a kafka-specific concern, not applicable to all types of receivers. If it had been originally written as a DStream[MessageAndMetadata], there would be less of a need for this patch. 2. Controlling timing, and thus checkpointing, based on the contents of the individual item. I can see this being applicable to many types of receivers, but not all. For instance, if I sleep inside this function, what happens for Kafka is not the same as what happens for e.g. Twitter. Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM Key: SPARK-3146 URL: https://issues.apache.org/jira/browse/SPARK-3146 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.0 Reporter: Saisai Shao Currently Spark Streaming Kafka API stores the key and value of each message into BM for processing, potentially this may lose the flexibility for different requirements: 1. currently topic/partition/offset information for each message is discarded by KafkaInputDStream. In some scenarios people may need this information to better filter the message, like SPARK-2388 described. 2. People may need to add timestamp for each message when feeding into Spark Streaming, which can better measure the system latency. 3. Checkpointing the partition/offsets or others... So here we add a messageHandler in interface to give people the flexibility to preprocess the message before storing into BM. In the meantime time this improvement keep compatible with current API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM
[ https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252160#comment-14252160 ] Hari Shreedharan commented on SPARK-3146: - The difference between interceptor pattern vs flatMap would be latency. Rather than have the operation run on the Spark cluster every batch, the operation would happen when the data is received and thus any intercepted messages would change/be filtered before the batch is generated. Sufficient checks would need to be present to ensure that all user code does not end up in an intercept call. I am not exactly sure what (1) buys you, but (2) seems to be easily possible via the interceptor pattern no? Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM Key: SPARK-3146 URL: https://issues.apache.org/jira/browse/SPARK-3146 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.0 Reporter: Saisai Shao Currently Spark Streaming Kafka API stores the key and value of each message into BM for processing, potentially this may lose the flexibility for different requirements: 1. currently topic/partition/offset information for each message is discarded by KafkaInputDStream. In some scenarios people may need this information to better filter the message, like SPARK-2388 described. 2. People may need to add timestamp for each message when feeding into Spark Streaming, which can better measure the system latency. 3. Checkpointing the partition/offsets or others... So here we add a messageHandler in interface to give people the flexibility to preprocess the message before storing into BM. In the meantime time this improvement keep compatible with current API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM
[ https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252210#comment-14252210 ] Cody Koeninger commented on SPARK-3146: --- (1) is important because MessageAndMetadata contains more information than just a key and value, namely topic, partition and offset. There's no reason to hide that information from users, and exposing it to them lets them fix all kinds of problems for themselves, namely SPARK-2388, manual offset management, and things I can't imagine yet. (2) Regarding interceptor, maybe this is just a semantic difference. The function you're talking about _is_ flatMap, namely (Container[A], (A) = Container[B]) : Container[B] The important thing is as you note, that it runs before storage and offset checkpointing happens. I'd find it unfortunate if the api was exposed in terms of an Interceptor interface that users had to implement and then use a setter for, rather than just simply passing in a function to the constructor... but ultimately, what matters is that the functionality is exposed. Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM Key: SPARK-3146 URL: https://issues.apache.org/jira/browse/SPARK-3146 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.0 Reporter: Saisai Shao Currently Spark Streaming Kafka API stores the key and value of each message into BM for processing, potentially this may lose the flexibility for different requirements: 1. currently topic/partition/offset information for each message is discarded by KafkaInputDStream. In some scenarios people may need this information to better filter the message, like SPARK-2388 described. 2. People may need to add timestamp for each message when feeding into Spark Streaming, which can better measure the system latency. 3. Checkpointing the partition/offsets or others... So here we add a messageHandler in interface to give people the flexibility to preprocess the message before storing into BM. In the meantime time this improvement keep compatible with current API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM
[ https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252256#comment-14252256 ] Hari Shreedharan commented on SPARK-3146: - I understand what you mean by flatMap (the reason I don't want to call it flatMap is that we already have a flatMap which behaves differently in terms of when and where it is executed). Interceptors can simply be functions passed in (the reason it is not a function in Flume is because Flume is all Java - functions could not be passed around till Java 8), the idea being the same. I think we both agree on the idea, we can discuss how it is exposed in more detail once we have more consensus on if it makes sense. For (1), I see that metadata is something that might be helpful. What if we pass the metadata to the interceptor method too (see how the BlockGenerator gets it now?). It is better than rewriting it as a DStream[MessageAndMetadata] - which may not be as easy to get done. Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM Key: SPARK-3146 URL: https://issues.apache.org/jira/browse/SPARK-3146 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.0 Reporter: Saisai Shao Currently Spark Streaming Kafka API stores the key and value of each message into BM for processing, potentially this may lose the flexibility for different requirements: 1. currently topic/partition/offset information for each message is discarded by KafkaInputDStream. In some scenarios people may need this information to better filter the message, like SPARK-2388 described. 2. People may need to add timestamp for each message when feeding into Spark Streaming, which can better measure the system latency. 3. Checkpointing the partition/offsets or others... So here we add a messageHandler in interface to give people the flexibility to preprocess the message before storing into BM. In the meantime time this improvement keep compatible with current API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM
[ https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252299#comment-14252299 ] Cody Koeninger commented on SPARK-3146: --- Yes, for the specific case of kafka, regardless of the name of the api, the user supplied function needs to be of type MessageAndMetadata = Iterable[R] Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM Key: SPARK-3146 URL: https://issues.apache.org/jira/browse/SPARK-3146 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.0 Reporter: Saisai Shao Currently Spark Streaming Kafka API stores the key and value of each message into BM for processing, potentially this may lose the flexibility for different requirements: 1. currently topic/partition/offset information for each message is discarded by KafkaInputDStream. In some scenarios people may need this information to better filter the message, like SPARK-2388 described. 2. People may need to add timestamp for each message when feeding into Spark Streaming, which can better measure the system latency. 3. Checkpointing the partition/offsets or others... So here we add a messageHandler in interface to give people the flexibility to preprocess the message before storing into BM. In the meantime time this improvement keep compatible with current API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM
[ https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252319#comment-14252319 ] Hari Shreedharan commented on SPARK-3146: - If you look at the addDataWithCallback method, we pass (data, metadata) to the BlockGenerator, where data here is K,V and metadata is the (topicAndPartition, msgAndMetadata.offset). Would passing data, metadata to an intercept method be useful? Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM Key: SPARK-3146 URL: https://issues.apache.org/jira/browse/SPARK-3146 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.0 Reporter: Saisai Shao Currently Spark Streaming Kafka API stores the key and value of each message into BM for processing, potentially this may lose the flexibility for different requirements: 1. currently topic/partition/offset information for each message is discarded by KafkaInputDStream. In some scenarios people may need this information to better filter the message, like SPARK-2388 described. 2. People may need to add timestamp for each message when feeding into Spark Streaming, which can better measure the system latency. 3. Checkpointing the partition/offsets or others... So here we add a messageHandler in interface to give people the flexibility to preprocess the message before storing into BM. In the meantime time this improvement keep compatible with current API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM
[ https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252397#comment-14252397 ] Cody Koeninger commented on SPARK-3146: --- ((K, V), (topicAndPartition, offset)) is all of the public information in a MessageAndMetadata object, aside from the decoders, so that's probably ok. If you're trying to come up with a uniform interface (Any, Any) = Iterable[R] doesn't seem much better than Any = Iterable[R] though. Also, just to be clear, the particular example calls you're talking about only exist in ReliableKafkaReceiver, not KafkaInputDStream, and we need something that works for both. Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM Key: SPARK-3146 URL: https://issues.apache.org/jira/browse/SPARK-3146 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.0 Reporter: Saisai Shao Currently Spark Streaming Kafka API stores the key and value of each message into BM for processing, potentially this may lose the flexibility for different requirements: 1. currently topic/partition/offset information for each message is discarded by KafkaInputDStream. In some scenarios people may need this information to better filter the message, like SPARK-2388 described. 2. People may need to add timestamp for each message when feeding into Spark Streaming, which can better measure the system latency. 3. Checkpointing the partition/offsets or others... So here we add a messageHandler in interface to give people the flexibility to preprocess the message before storing into BM. In the meantime time this improvement keep compatible with current API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM
[ https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252742#comment-14252742 ] Saisai Shao commented on SPARK-3146: Hi all, thanks a lot for your comments. My original purpose of this proposal is to solve SPARK-2388 and make this solution more general, I'm not sure if other receivers have such requirements, also is it reasonable to add this interceptor to some receivers like socket, file ? We can further improve the flexibility of streaming API by this interceptors, like filter out some unwanted messages, track the latency by timestamping the message once received. But the problem is that such flexibility will make user abusing this API to process the data once received, rather than by transforming function, this potentially disobey the semantics of Spark Streaming. I think probably we should redesign the whole thing to make it both flexible and meaningful. Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM Key: SPARK-3146 URL: https://issues.apache.org/jira/browse/SPARK-3146 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.0 Reporter: Saisai Shao Currently Spark Streaming Kafka API stores the key and value of each message into BM for processing, potentially this may lose the flexibility for different requirements: 1. currently topic/partition/offset information for each message is discarded by KafkaInputDStream. In some scenarios people may need this information to better filter the message, like SPARK-2388 described. 2. People may need to add timestamp for each message when feeding into Spark Streaming, which can better measure the system latency. 3. Checkpointing the partition/offsets or others... So here we add a messageHandler in interface to give people the flexibility to preprocess the message before storing into BM. In the meantime time this improvement keep compatible with current API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM
[ https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195146#comment-14195146 ] Cody Koeninger commented on SPARK-3146: --- I think this PR is an elegant way to solve SPARK-2388, which is an otherwise blocking bug for our usage of kafka. Absent a concrete design for doing something equivalent for all InputDStreams, I'd encourage merging it. Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM Key: SPARK-3146 URL: https://issues.apache.org/jira/browse/SPARK-3146 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.0 Reporter: Saisai Shao Currently Spark Streaming Kafka API stores the key and value of each message into BM for processing, potentially this may lose the flexibility for different requirements: 1. currently topic/partition/offset information for each message is discarded by KafkaInputDStream. In some scenarios people may need this information to better filter the message, like SPARK-2388 described. 2. People may need to add timestamp for each message when feeding into Spark Streaming, which can better measure the system latency. 3. Checkpointing the partition/offsets or others... So here we add a messageHandler in interface to give people the flexibility to preprocess the message before storing into BM. In the meantime time this improvement keep compatible with current API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM
[ https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119128#comment-14119128 ] Saisai Shao commented on SPARK-3146: Hi [~tdas], Sorry for late response, thanks a lot for your reply. My initial thought was to add flexibility to Spark Streaming Kafka's API, because in some cases people needs to know the topic of this message or others. Maybe it's worthy to change to a generic way to all the input dstreams. I will carefully think about the implementations. Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM Key: SPARK-3146 URL: https://issues.apache.org/jira/browse/SPARK-3146 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.0 Reporter: Saisai Shao Currently Spark Streaming Kafka API stores the key and value of each message into BM for processing, potentially this may lose the flexibility for different requirements: 1. currently topic/partition/offset information for each message is discarded by KafkaInputDStream. In some scenarios people may need this information to better filter the message, like SPARK-2388 described. 2. People may need to add timestamp for each message when feeding into Spark Streaming, which can better measure the system latency. 3. Checkpointing the partition/offsets or others... So here we add a messageHandler in interface to give people the flexibility to preprocess the message before storing into BM. In the meantime time this improvement keep compatible with current API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM
[ https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111614#comment-14111614 ] Tathagata Das commented on SPARK-3146: -- I think this can be further generalized to the interceptor pattern of flume, that is applicable to all types of receivers. So I feel that this should be implemented in such a way that generally applies to all input dstreams at the API level. For example, InputDStream.setInterceptorFunction(function: (T) = Iterable[T]) Then all receivers should be able to apply a filter function. [~jerryshao] What do you think? Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM Key: SPARK-3146 URL: https://issues.apache.org/jira/browse/SPARK-3146 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.0 Reporter: Saisai Shao Currently Spark Streaming Kafka API stores the key and value of each message into BM for processing, potentially this may lose the flexibility for different requirements: 1. currently topic/partition/offset information for each message is discarded by KafkaInputDStream. In some scenarios people may need this information to better filter the message, like SPARK-2388 described. 2. People may need to add timestamp for each message when feeding into Spark Streaming, which can better measure the system latency. 3. Checkpointing the partition/offsets or others... So here we add a messageHandler in interface to give people the flexibility to preprocess the message before storing into BM. In the meantime time this improvement keep compatible with current API. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM
[ https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103486#comment-14103486 ] Saisai Shao commented on SPARK-3146: This issue can actually solve the problem mentioned in SPARK-2388, besides offer a more general way. Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM Key: SPARK-3146 URL: https://issues.apache.org/jira/browse/SPARK-3146 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2 Reporter: Saisai Shao Currently Spark Streaming Kafka API stores the key and value of each message into BM for processing, potentially this may lose the flexibility for different requirements: 1. currently topic/partition/offset information for each message is discarded by KafkaInputDStream. In some scenarios people may need this information to better filter the message, like SPARK-2388 described. 2. People may need to add timestamp for each message when feeding into Spark Streaming, which can better measure the system latency. So here we add a messageHandler in interface to give people the flexibility to preprocess the message before storing into BM. In the meantime time this improvement keep compatible with current API. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM
[ https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103520#comment-14103520 ] Apache Spark commented on SPARK-3146: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/2053 Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM Key: SPARK-3146 URL: https://issues.apache.org/jira/browse/SPARK-3146 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2 Reporter: Saisai Shao Currently Spark Streaming Kafka API stores the key and value of each message into BM for processing, potentially this may lose the flexibility for different requirements: 1. currently topic/partition/offset information for each message is discarded by KafkaInputDStream. In some scenarios people may need this information to better filter the message, like SPARK-2388 described. 2. People may need to add timestamp for each message when feeding into Spark Streaming, which can better measure the system latency. 3. Checkpointing the partition/offsets or others... So here we add a messageHandler in interface to give people the flexibility to preprocess the message before storing into BM. In the meantime time this improvement keep compatible with current API. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org