[jira] [Comment Edited] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573459#comment-15573459 ] Michael Armbrust edited comment on SPARK-17812 at 10/13/16 10:53 PM: - +1 to the suggested ways of subscribing, and for using "assign" as a familiar name. I would probably leave it with a single option like this: {code} .option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": 1234, "1", 4567}}""") {code} Were you can give -1 or -2 (again following kafka) for specific partitions. {{startingTime}} could be added when we support time indexes. was (Author: marmbrus): +1 to the suggested was of subscribing, and for using "assign" as a familiar name. I would probably leave it with a single option like this: {code} .option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": 1234, "1", 4567}}""") {code} Were you can give -1 or -2 (again following kafka) for specific partitions. {{startingTime}} could be added when we support time indexes. > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573459#comment-15573459 ] Michael Armbrust edited comment on SPARK-17812 at 10/13/16 10:53 PM: - +1 to the suggested was of subscribing, and for using "assign" as a familiar name. I would probably leave it with a single option like this: {code} .option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": 1234, "1", 4567}}""") {code} Were you can give -1 or -2 (again following kafka) for specific partitions. {{startingTime}} could be added when we support time indexes. was (Author: marmbrus): +1 to the suggested was of subscribing, and for using "assign" as a familiar name. I would probably leave it with a single option like this: {{.option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": 1234, "1", 4567\}\}"""}} > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573432#comment-15573432 ] Cody Koeninger edited comment on SPARK-17812 at 10/13/16 10:44 PM: --- If you're seriously worried that people are going to get confused, {noformat} .option("defaultOffsets", "earliest" | "latest") .option("specificOffsets", """{"topicFoo": {"0": 1234, "1", 4567}}""") {noformat} let those two at least not be mutually exclusive, and punt on the question of precedence until there's an actual startingTime or startingX ticket. was (Author: c...@koeninger.org): If you're seriously worried that people are going to get confused, {noformat} .option("startingOffsets", """{"topicFoo": {"0": 1234, "1", 4567}}""") .option("defaultOffsets", "earliest" | "latest") {noformat} let those two at least not be mutually exclusive, and punt on the question of precedence until there's an actual startingTime or startingX ticket. > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573395#comment-15573395 ] Cody Koeninger edited comment on SPARK-17812 at 10/13/16 10:25 PM: --- 1. we dont have lists, we have strings. regexes and valid topic names have overlaps (dot is the obvious one). 2. Mapping directly to kafka method names means we don't have to come up with some other (weird and possibly overlapping) name when they add more ways to subscribe, we just use theirs. 3. I think this "starting X mssages" is a mess with kafka semantics for the reasons both you and I have already expressed. At any rate, I think Michael already clearly punted the "starting X messages" case to a different ticket. 4. I think it's more than sufficiently clear as suggested, no one is going to expect that a specific offset they provided is going to be overruled by a general single default. The implementation is also crystal clear - seek to the position identified by startingTime, then seek to any specific offsets for specific partitions Yes, this is all bikeshedding, but it's bikeshedding that directly affects what people are actually able to do with the api. Needlessly restricting it for reasons that have nothing to do with safety is just going to piss users off for no reason. Just because you don't have a use case that needs it, doesn't mean you should arbitrarily prevent users from doing it. Please, just choose something and let me build it so that people can actually use the thing by the next release was (Author: c...@koeninger.org): 1. we dont have lists, we have strings. regexes and valid topic names have overlaps (dot is the obvious one). 2. Mapping directly to kafka method names means we don't have to come up with some other (weird and possibly overlapping) name when they add more ways to subscribe, we just use theirs. 3. I think this is a mess with kafka semantics for the reasons both you and I have already expressed. At any rate, I think Michael already clearly punted the "starting X" case to a different topic. 4. I think it's more than sufficiently clear as suggested, no one is going to expect that a specific offset they provided is going to be overruled by a general single default. The implementation is also crystal clear - seek to the position identified by startingTime, then seek to any specific offsets for specific partitions Yes, this is all bikeshedding, but it's bikeshedding that directly affects what people are actually able to do with the api. Needlessly restricting it for reasons that have nothing to do with safety is just going to piss users off for no reason. Just because you don't have a use case that needs it, doesn't mean you should arbitrarily prevent users from doing it. Please, just choose something and let me build it so that people can actually use the thing by the next release > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573166#comment-15573166 ] Cody Koeninger edited comment on SPARK-17812 at 10/13/16 9:17 PM: -- Here's my concrete suggestion: 3 mutually exclusive ways of subscribing: {noformat} .option("subscribe","topicFoo,topicBar") .option("subscribePattern","topic.*") .option("assign","""{"topicfoo": [0, 1],"topicbar": [0, 1]}""") {noformat} where assign can only be specified that way, no inline offsets 2 non-mutually exclusive ways of specifying starting position, explicit startingOffsets obviously take priority: {noformat} .option("startingOffsets", """{"topicFoo": {"0": 1234, "1", 4567}}""") .option("startingTime", "earliest" | "latest" | long) {noformat} where long is a timestamp, work to be done on that later. Note that even kafka 0.8 has a (really crappy based on log file modification time) api for time so later pursuing timestamps startingTime doesn't necessarily exclude it was (Author: c...@koeninger.org): Here's my concrete suggestion: 3 mutually exclusive ways of subscribing: {noformat} .option("subscribe","topicFoo,topicBar") .option("subscribePattern","topic.*") .option("assign","""{"topicfoo": [0, 1],"topicbar": [0, 1]}""") {noformat} where assign can only be specified that way, no inline offsets 2 non-mutually exclusive ways of specifying starting position, explicit startingOffsets obviously take priority: {noformat} .option("startingOffsets", """{"topicFoo": {"0": 1234, "1", 4567}""") .option("startingTime", "earliest" | "latest" | long) {noformat} where long is a timestamp, work to be done on that later. Note that even kafka 0.8 has a (really crappy based on log file modification time) api for time so later pursuing timestamps startingTime doesn't necessarily exclude it > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572922#comment-15572922 ] Cody Koeninger edited comment on SPARK-17812 at 10/13/16 8:33 PM: -- Sorry, I didn't see this comment until just now. X offsets back per partition is not a reasonable proxy for time when you're dealing with a stream that has multiple topics in it. Agree we should break that out, focus on defining starting offsets in this ticket. The concern with startingOffsets naming is that, because auto.offset.reset is orthogonal to specifying some offsets, you have a situation like this: {noformat} .format("kafka") .option("subscribePattern", "topic.*") .option("startingOffset", "latest") .option("startingOffsetForRealzYo", """ { "topicfoo" : { "0": 1234, "1": 4567 }, "topicbar" : { "0": 1234, "1": 4567 }}""") {noformat} where startingOffsetForRealzYo has a more reasonable name that conveys it is specifying starting offsets, yet is not confusingly similar to startingOffset Trying to hack it all into one json as an alternative, with a "default" topic, means you're going to have to pick a key that isn't a valid topic, or add yet another layer of indirection. It also makes it harder to make the format consistent with SPARK-17829 (which seems like a good thing to keep consistent, I agree) Obviously I think you should just change the name, but it's your show. was (Author: c...@koeninger.org): Sorry, I didn't see this comment until just now. X offsets back per partition is not a reasonable proxy for time when you're dealing with a stream that has multiple topics in it. Agree we should break that out, focus on defining starting offsets in this ticket. The concern with startingOffsets naming is that, because auto.offset.reset is orthogonal to specifying some offsets, you have a situation like this: .format("kafka") .option("subscribePattern", "topic.*") .option("startingOffset", "latest") .option("startingOffsetForRealzYo", """ { "topicfoo" : { "0": 1234, "1": 4567 }, "topicbar" : { "0": 1234, "1": 4567 }}""") where startingOffsetForRealzYo has a more reasonable name that conveys it is specifying starting offsets, yet is not confusingly similar to startingOffset Trying to hack it all into one json as an alternative, with a "default" topic, means you're going to have to pick a key that isn't a valid topic, or add yet another layer of indirection. It also makes it harder to make the format consistent with SPARK-17829 (which seems like a good thing to keep consistent, I agree) Obviously I think you should just change the name, but it's your show. > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org