[jira] [Commented] (FLINK-7223) Increase DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS for Flink-kinesis-connector
[ https://issues.apache.org/jira/browse/FLINK-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150154#comment-16150154 ] Bowen Li commented on FLINK-7223: - [~StephanEwen] We've already filed such a request to AWS through our company. Well, it's not easy and very practical to ask Amazon make such internal changes - I guess such changes will require AWS to re-design a lot of stuff. Just look at how slow they are responding to a simple KinesisProducer issues on https://github.com/awslabs/amazon-kinesis-producer/issues ... > Increase DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS for Flink-kinesis-connector > > > Key: FLINK-7223 > URL: https://issues.apache.org/jira/browse/FLINK-7223 > Project: Flink > Issue Type: Improvement > Components: Kinesis Connector >Affects Versions: 1.3.0 >Reporter: Bowen Li >Assignee: Bowen Li >Priority: Minor > Fix For: 1.4.0 > > > Background: {{DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS}} in > {{org.apache.flink.streaming.connectors.kinesis.config.ConsumerConfigConstants}} > is the default value for Flink to call Kinesis's {{describeStream()}} API. > Problem: Right now, its value is 10,000millis (10sec), which is too short. We > ran into problems that Flink-kinesis-connector's call of {{describeStream()}} > exceeds Kinesis rate limit, and broken Flink taskmanager. > According to > http://docs.aws.amazon.com/kinesis/latest/APIReference/API_DescribeStream.html, > > "This operation has a limit of 10 transactions per second per account.". What > it means is that the 10transaction/account is a limit on a single > organization's AWS account..:( We contacted AWS Support, and confirmed > this. If you have more applications (either other Flink apps or non-Flink > apps) competing aggressively with your Flink app on this API, your Flink app > breaks. > I propose increasing the value DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS from > 10,000millis(10sec) to preferably 300,000 (5min). Or at least 60,000 (1min) > if anyone has a solid reason arguing that 5min is too long, > This is also related to https://issues.apache.org/jira/browse/FLINK-6365 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7223) Increase DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS for Flink-kinesis-connector
[ https://issues.apache.org/jira/browse/FLINK-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125845#comment-16125845 ] Stephan Ewen commented on FLINK-7223: - [~phoenixjiangnan] Do you think it is worthwhile filing an issue at Amazon for that? It seems crazy that the number of describe requests is so low and across all apps of one account. > Increase DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS for Flink-kinesis-connector > > > Key: FLINK-7223 > URL: https://issues.apache.org/jira/browse/FLINK-7223 > Project: Flink > Issue Type: Improvement > Components: Kinesis Connector >Affects Versions: 1.3.0 >Reporter: Bowen Li >Assignee: Bowen Li >Priority: Minor > Fix For: 1.4.0 > > > Background: {{DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS}} in > {{org.apache.flink.streaming.connectors.kinesis.config.ConsumerConfigConstants}} > is the default value for Flink to call Kinesis's {{describeStream()}} API. > Problem: Right now, its value is 10,000millis (10sec), which is too short. We > ran into problems that Flink-kinesis-connector's call of {{describeStream()}} > exceeds Kinesis rate limit, and broken Flink taskmanager. > According to > http://docs.aws.amazon.com/kinesis/latest/APIReference/API_DescribeStream.html, > > "This operation has a limit of 10 transactions per second per account.". What > it means is that the 10transaction/account is a limit on a single > organization's AWS account..:( We contacted AWS Support, and confirmed > this. If you have more applications (either other Flink apps or non-Flink > apps) competing aggressively with your Flink app on this API, your Flink app > breaks. > I propose increasing the value DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS from > 10,000millis(10sec) to preferably 300,000 (5min). Or at least 60,000 (1min) > if anyone has a solid reason arguing that 5min is too long, > This is also related to https://issues.apache.org/jira/browse/FLINK-6365 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7223) Increase DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS for Flink-kinesis-connector
[ https://issues.apache.org/jira/browse/FLINK-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119222#comment-16119222 ] Bowen Li commented on FLINK-7223: - I'm actually fine with the default value, since we are already aware of this issue, and we override the default value ourself. But from a enterprise user's point of view, the assumption of 'this means running up to 50 Flink jobs per account' is not practical at all for a big AWS enterprise customer. Here's why: In an enterprise AWS account, lots of other services are contributing to saturating the 10requests/sec throughput. In our (OfferUp, inc) prod AWS account, we are hitting that cap even before having Flink. Having more and more Flink jobs makes things worse, and breaks other services. Our Flink jobs are not more important than other services, so we can't either allocate resources solely for Flink or compete with other services. Thus we set Flink's discovery interval to be {{1 hour}}. It's probably not a topic worths a long discussion :) If you guys feel the default value is ok, I'll good with it > Increase DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS for Flink-kinesis-connector > > > Key: FLINK-7223 > URL: https://issues.apache.org/jira/browse/FLINK-7223 > Project: Flink > Issue Type: Improvement > Components: Kinesis Connector >Affects Versions: 1.3.0 >Reporter: Bowen Li >Assignee: Bowen Li >Priority: Minor > Fix For: 1.4.0 > > > Background: {{DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS}} in > {{org.apache.flink.streaming.connectors.kinesis.config.ConsumerConfigConstants}} > is the default value for Flink to call Kinesis's {{describeStream()}} API. > Problem: Right now, its value is 10,000millis (10sec), which is too short. We > ran into problems that Flink-kinesis-connector's call of {{describeStream()}} > exceeds Kinesis rate limit, and broken Flink taskmanager. > According to > http://docs.aws.amazon.com/kinesis/latest/APIReference/API_DescribeStream.html, > > "This operation has a limit of 10 transactions per second per account.". What > it means is that the 10transaction/account is a limit on a single > organization's AWS account..:( We contacted AWS Support, and confirmed > this. If you have more applications (either other Flink apps or non-Flink > apps) competing aggressively with your Flink app on this API, your Flink app > breaks. > I propose increasing the value DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS from > 10,000millis(10sec) to preferably 300,000 (5min). Or at least 60,000 (1min) > if anyone has a solid reason arguing that 5min is too long, > This is also related to https://issues.apache.org/jira/browse/FLINK-6365 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7223) Increase DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS for Flink-kinesis-connector
[ https://issues.apache.org/jira/browse/FLINK-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16107888#comment-16107888 ] Stephan Ewen commented on FLINK-7223: - Hmmm... Is there a way to auto-configure this value as in the following way: - The Flink job would theoretically do discovery once per 5 seconds (this means running up to 50 Flink jobs er account) - The interval in which the TaskManagers can discover is than {{5 seconds x source parallelism}} > Increase DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS for Flink-kinesis-connector > > > Key: FLINK-7223 > URL: https://issues.apache.org/jira/browse/FLINK-7223 > Project: Flink > Issue Type: Improvement > Components: Kinesis Connector >Affects Versions: 1.3.0 >Reporter: Bowen Li >Assignee: Bowen Li >Priority: Minor > Fix For: 1.4.0 > > > Background: {{DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS}} in > {{org.apache.flink.streaming.connectors.kinesis.config.ConsumerConfigConstants}} > is the default value for Flink to call Kinesis's {{describeStream()}} API. > Problem: Right now, its value is 10,000millis (10sec), which is too short. We > ran into problems that Flink-kinesis-connector's call of {{describeStream()}} > exceeds Kinesis rate limit, and broken Flink taskmanager. > According to > http://docs.aws.amazon.com/kinesis/latest/APIReference/API_DescribeStream.html, > > "This operation has a limit of 10 transactions per second per account.". What > it means is that the 10transaction/account is a limit on a single > organization's AWS account..:( We contacted AWS Support, and confirmed > this. If you have more applications (either other Flink apps or non-Flink > apps) competing aggressively with your Flink app on this API, your Flink app > breaks. > I propose increasing the value DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS from > 10,000millis(10sec) to preferably 300,000 (5min). Or at least 60,000 (1min) > if anyone has a solid reason arguing that 5min is too long, > This is also related to https://issues.apache.org/jira/browse/FLINK-6365 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7223) Increase DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS for Flink-kinesis-connector
[ https://issues.apache.org/jira/browse/FLINK-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106325#comment-16106325 ] Bowen Li commented on FLINK-7223: - What do you guys think 3min or 1min? I strongly believe that this default value should be the most practical one so it won't cause much trouble to users, rather than the theoretically shortest one but rarely works in practice > Increase DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS for Flink-kinesis-connector > > > Key: FLINK-7223 > URL: https://issues.apache.org/jira/browse/FLINK-7223 > Project: Flink > Issue Type: Improvement > Components: Kinesis Connector >Affects Versions: 1.3.0 >Reporter: Bowen Li >Assignee: Bowen Li >Priority: Minor > Fix For: 1.4.0 > > > Background: {{DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS}} in > {{org.apache.flink.streaming.connectors.kinesis.config.ConsumerConfigConstants}} > is the default value for Flink to call Kinesis's {{describeStream()}} API. > Problem: Right now, its value is 10,000millis (10sec), which is too short. We > ran into problems that Flink-kinesis-connector's call of {{describeStream()}} > exceeds Kinesis rate limit, and broken Flink taskmanager. > According to > http://docs.aws.amazon.com/kinesis/latest/APIReference/API_DescribeStream.html, > > "This operation has a limit of 10 transactions per second per account.". What > it means is that the 10transaction/account is a limit on a single > organization's AWS account..:( We contacted AWS Support, and confirmed > this. If you have more applications (either other Flink apps or non-Flink > apps) competing aggressively with your Flink app on this API, your Flink app > breaks. > I propose increasing the value DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS from > 10,000millis(10sec) to preferably 300,000 (5min). Or at least 60,000 (1min) > if anyone has a solid reason arguing that 5min is too long, > This is also related to https://issues.apache.org/jira/browse/FLINK-6365 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7223) Increase DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS for Flink-kinesis-connector
[ https://issues.apache.org/jira/browse/FLINK-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094173#comment-16094173 ] Bowen Li commented on FLINK-7223: - [~sthm] can we also get your insights here? > Increase DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS for Flink-kinesis-connector > > > Key: FLINK-7223 > URL: https://issues.apache.org/jira/browse/FLINK-7223 > Project: Flink > Issue Type: Improvement > Components: Kinesis Connector >Affects Versions: 1.3.0 >Reporter: Bowen Li >Assignee: Bowen Li >Priority: Minor > Fix For: 1.4.0 > > > Background: {{DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS}} in > {{org.apache.flink.streaming.connectors.kinesis.config.ConsumerConfigConstants}} > is the default value for Flink to call Kinesis's {{describeStream()}} API. > Problem: Right now, its value is 10,000millis (10sec), which is too short. We > ran into problems that Flink-kinesis-connector's call of {{describeStream()}} > exceeds Kinesis rate limit, and broken Flink taskmanager. > According to > http://docs.aws.amazon.com/kinesis/latest/APIReference/API_DescribeStream.html, > > "This operation has a limit of 10 transactions per second per account.". What > it means is that the 10transaction/account is a limit on a single > organization's AWS account..:( We contacted AWS Support, and confirmed > this. If you have more applications (either other Flink apps or non-Flink > apps) competing aggressively with your Flink app on this API, your Flink app > breaks. > I propose increasing the value DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS from > 10,000millis(10sec) to preferably 300,000 (5min). Or at least 60,000 (1min) > if anyone has a solid reason arguing that 5min is too long, > This is also related to https://issues.apache.org/jira/browse/FLINK-6365 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7223) Increase DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS for Flink-kinesis-connector
[ https://issues.apache.org/jira/browse/FLINK-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093498#comment-16093498 ] Bowen Li commented on FLINK-7223: - I believe what Gordon and I proposed are complementary. We need to do both. We can create another ticket for what Gordon proposed for code change, and link it to this ticket. 10sec is simply too short and intensive. What do you think about 3min or 1min? AWS takes about 5 minutes to update kinesics shard itself, so users are of right expectation that Flink takes the same amount of time. > Increase DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS for Flink-kinesis-connector > > > Key: FLINK-7223 > URL: https://issues.apache.org/jira/browse/FLINK-7223 > Project: Flink > Issue Type: Improvement > Components: Kinesis Connector >Affects Versions: 1.3.0 >Reporter: Bowen Li >Assignee: Bowen Li >Priority: Minor > Fix For: 1.4.0 > > > Background: {{DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS}} in > {{org.apache.flink.streaming.connectors.kinesis.config.ConsumerConfigConstants}} > is the default value for Flink to call Kinesis's {{describeStream()}} API. > Problem: Right now, its value is 10,000millis (10sec), which is too short. We > ran into problems that Flink-kinesis-connector's call of {{describeStream()}} > exceeds Kinesis rate limit, and broken Flink taskmanager. > According to > http://docs.aws.amazon.com/kinesis/latest/APIReference/API_DescribeStream.html, > > "This operation has a limit of 10 transactions per second per account.". What > it means is that the 10transaction/account is a limit on a single > organization's AWS account..:( We contacted AWS Support, and confirmed > this. If you have more applications (either other Flink apps or non-Flink > apps) competing aggressively with your Flink app on this API, your Flink app > breaks. > I propose increasing the value DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS from > 10,000millis(10sec) to preferably 300,000 (5min). Or at least 60,000 (1min) > if anyone has a solid reason arguing that 5min is too long, > This is also related to https://issues.apache.org/jira/browse/FLINK-6365 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7223) Increase DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS for Flink-kinesis-connector
[ https://issues.apache.org/jira/browse/FLINK-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093375#comment-16093375 ] Tzu-Li (Gordon) Tai commented on FLINK-7223: This config affects the responsiveness to new shards of the stream. I agree that 5 minutes is a bit too much by default. One other thing: the AWS limitation is "This operation has a limit of 10 transactions per second per account." Therefore, IMO, even if we change this discovery interval to 5 minutes, you would still bump into the limitation if the source parallelism is high (each source instance performs discovery independently). We could consider: 1. Be a little less strict when handling exceptions due to the limitation cap 2. Perhaps add a little more jitter into the discovery interval to even out the requests across multiple subtasks. My consideration is that simply increasing the discovery interval would not really solve the problem of {{describeStream}} API rate limitations. > Increase DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS for Flink-kinesis-connector > > > Key: FLINK-7223 > URL: https://issues.apache.org/jira/browse/FLINK-7223 > Project: Flink > Issue Type: Improvement > Components: Kinesis Connector >Affects Versions: 1.3.0 >Reporter: Bowen Li >Assignee: Bowen Li >Priority: Minor > Fix For: 1.4.0 > > > Background: {{DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS}} in > {{org.apache.flink.streaming.connectors.kinesis.config.ConsumerConfigConstants}} > is the default value for Flink to call Kinesis's {{describeStream()}} API. > Problem: Right now, its value is 10,000millis (10sec), which is too short. We > ran into problems that Flink-kinesis-connector's call of {{describeStream()}} > exceeds Kinesis rate limit, and broken Flink taskmanager. > According to > http://docs.aws.amazon.com/kinesis/latest/APIReference/API_DescribeStream.html, > > "This operation has a limit of 10 transactions per second per account.". What > it means is that the 10transaction/account is a limit on a single > organization's AWS account..:( We contacted AWS Support, and confirmed > this. If you have more applications (either other Flink apps or non-Flink > apps) competing aggressively with your Flink app on this API, your Flink app > breaks. > I propose increasing the value DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS from > 10,000millis(10sec) to preferably 300,000 (5min). Or at least 60,000 (1min) > if anyone has a solid reason arguing that 5min is too long, > This is also related to https://issues.apache.org/jira/browse/FLINK-6365 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7223) Increase DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS for Flink-kinesis-connector
[ https://issues.apache.org/jira/browse/FLINK-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093334#comment-16093334 ] Stephan Ewen commented on FLINK-7223: - [~tzulitai] what do you think about that? What is affected by that? The responsiveness to appearance of new shards (elasticity of Kinesis), or the responsiveness to new streams? If the former was affected, then 5 minutes can be a lot. For the later I would think it is not too bad. > Increase DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS for Flink-kinesis-connector > > > Key: FLINK-7223 > URL: https://issues.apache.org/jira/browse/FLINK-7223 > Project: Flink > Issue Type: Improvement > Components: Kinesis Connector >Affects Versions: 1.3.0 >Reporter: Bowen Li >Assignee: Bowen Li >Priority: Minor > Fix For: 1.4.0 > > > Background: {{DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS}} in > {{org.apache.flink.streaming.connectors.kinesis.config.ConsumerConfigConstants}} > is the default value for Flink to call Kinesis's {{describeStream()}} API. > Problem: Right now, its value is 10,000millis (10sec), which is too short. We > ran into problems that Flink-kinesis-connector's call of {{describeStream()}} > exceeds Kinesis rate limit, and broken Flink taskmanager. > According to > http://docs.aws.amazon.com/kinesis/latest/APIReference/API_DescribeStream.html, > > "This operation has a limit of 10 transactions per second per account.". What > it means is that the 10transaction/account is a limit on a single > organization's AWS account..:( We contacted AWS Support, and confirmed > this. If you have more applications (either other Flink apps or non-Flink > apps) competing aggressively with your Flink app on this API, your Flink app > breaks. > I propose increasing the value DEFAULT_SHARD_DISCOVERY_INTERVAL_MILLIS from > 10,000millis(10sec) to preferably 300,000 (5min). Or at least 60,000 (1min) > if anyone has a solid reason arguing that 5min is too long, > This is also related to https://issues.apache.org/jira/browse/FLINK-6365 -- This message was sent by Atlassian JIRA (v6.4.14#64029)