[jira] [Commented] (FLINK-9681) Make sure minRetentionTime not equal to maxRetentionTime

2018-07-05 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534073#comment-16534073
 ] 

ASF GitHub Bot commented on FLINK-9681:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/6255


> Make sure minRetentionTime not equal to maxRetentionTime
> 
>
> Key: FLINK-9681
> URL: https://issues.apache.org/jira/browse/FLINK-9681
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API  SQL
>Reporter: Hequn Cheng
>Assignee: Hequn Cheng
>Priority: Major
>  Labels: pull-request-available
>
> Currently, for a group by(or other operators), if minRetentionTime equals to 
> maxRetentionTime, the group by operator will register a timer for each record 
> coming at different time which cause performance problem. The reasoning for 
> having two parameters is that we can avoid to register many timers if we have 
> more freedom when to discard state. As min equals to max cause performance 
> problem it is better to make sure these two parameters are not same.
> Any suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9681) Make sure minRetentionTime not equal to maxRetentionTime

2018-07-05 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533846#comment-16533846
 ] 

ASF GitHub Bot commented on FLINK-9681:
---

Github user fhueske commented on the issue:

https://github.com/apache/flink/pull/6255
  
Thanks for the update @hequn8128.

I'll merge this


> Make sure minRetentionTime not equal to maxRetentionTime
> 
>
> Key: FLINK-9681
> URL: https://issues.apache.org/jira/browse/FLINK-9681
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API  SQL
>Reporter: Hequn Cheng
>Assignee: Hequn Cheng
>Priority: Major
>  Labels: pull-request-available
>
> Currently, for a group by(or other operators), if minRetentionTime equals to 
> maxRetentionTime, the group by operator will register a timer for each record 
> coming at different time which cause performance problem. The reasoning for 
> having two parameters is that we can avoid to register many timers if we have 
> more freedom when to discard state. As min equals to max cause performance 
> problem it is better to make sure these two parameters are not same.
> Any suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9681) Make sure minRetentionTime not equal to maxRetentionTime

2018-07-05 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533794#comment-16533794
 ] 

ASF GitHub Bot commented on FLINK-9681:
---

Github user hequn8128 commented on the issue:

https://github.com/apache/flink/pull/6255
  
@fhueske Comments have been addressed. Thanks for your review.


> Make sure minRetentionTime not equal to maxRetentionTime
> 
>
> Key: FLINK-9681
> URL: https://issues.apache.org/jira/browse/FLINK-9681
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API  SQL
>Reporter: Hequn Cheng
>Assignee: Hequn Cheng
>Priority: Major
>  Labels: pull-request-available
>
> Currently, for a group by(or other operators), if minRetentionTime equals to 
> maxRetentionTime, the group by operator will register a timer for each record 
> coming at different time which cause performance problem. The reasoning for 
> having two parameters is that we can avoid to register many timers if we have 
> more freedom when to discard state. As min equals to max cause performance 
> problem it is better to make sure these two parameters are not same.
> Any suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9681) Make sure minRetentionTime not equal to maxRetentionTime

2018-07-05 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533618#comment-16533618
 ] 

ASF GitHub Bot commented on FLINK-9681:
---

Github user fhueske commented on a diff in the pull request:

https://github.com/apache/flink/pull/6255#discussion_r200330683
  
--- Diff: docs/dev/table/streaming.md ---
@@ -591,16 +589,14 @@ qConfig.withIdleStateRetentionTime(Time.hours(12);
 
 val qConfig: StreamQueryConfig = ???
 
-// set idle state retention time: min = 12 hour, max = 16 hours
-qConfig.withIdleStateRetentionTime(Time.hours(12), Time.hours(16))
-// set idle state retention time. min = max = 12 hours
-qConfig.withIdleStateRetentionTime(Time.hours(12)
+// set idle state retention time: min = 12 hour, max = 24 hours
+qConfig.withIdleStateRetentionTime(Time.hours(12), Time.hours(24))
 
 {% endhighlight %}
 
 
 
-Configuring different minimum and maximum idle state retention times is 
more efficient because it reduces the internal book-keeping of a query for when 
to remove state.
+Configuring different minimum and maximum idle state retention times is 
more efficient because it reduces the internal book-keeping of a query for when 
to remove state. Difference between minTime and maxTime shoud be at least 5 
minutes.
--- End diff --

The "... more efficient ..." does not apply anymore. Maybe rephrase to 

> Cleaning up state requires additional bookkeeping which becomes less 
expensive for larger differences of `minTime` and `maxTime`. The difference 
between `minTime` and `maxTime` must be at least 5 minutes.




> Make sure minRetentionTime not equal to maxRetentionTime
> 
>
> Key: FLINK-9681
> URL: https://issues.apache.org/jira/browse/FLINK-9681
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API  SQL
>Reporter: Hequn Cheng
>Assignee: Hequn Cheng
>Priority: Major
>  Labels: pull-request-available
>
> Currently, for a group by(or other operators), if minRetentionTime equals to 
> maxRetentionTime, the group by operator will register a timer for each record 
> coming at different time which cause performance problem. The reasoning for 
> having two parameters is that we can avoid to register many timers if we have 
> more freedom when to discard state. As min equals to max cause performance 
> problem it is better to make sure these two parameters are not same.
> Any suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9681) Make sure minRetentionTime not equal to maxRetentionTime

2018-07-05 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533619#comment-16533619
 ] 

ASF GitHub Bot commented on FLINK-9681:
---

Github user fhueske commented on a diff in the pull request:

https://github.com/apache/flink/pull/6255#discussion_r200333630
  
--- Diff: 
flink-libraries/flink-table/src/test/scala/org/apache/flink/table/runtime/harness/HarnessTestBase.scala
 ---
@@ -384,4 +386,12 @@ object HarnessTestBase {
   value.row.getField(selectorField).asInstanceOf[T]
 }
   }
+
+  /**
+* Test class used to test min and max retention time.
+*/
+  class StreamQueryConfigTest(min: Time, max: Time) extends 
StreamQueryConfig {
--- End diff --

I would rename the class to `TestStreamQueryConfig` because the `Test` at 
the end suggests that this class is testing something instead of being a util 
for a test.


> Make sure minRetentionTime not equal to maxRetentionTime
> 
>
> Key: FLINK-9681
> URL: https://issues.apache.org/jira/browse/FLINK-9681
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API  SQL
>Reporter: Hequn Cheng
>Assignee: Hequn Cheng
>Priority: Major
>  Labels: pull-request-available
>
> Currently, for a group by(or other operators), if minRetentionTime equals to 
> maxRetentionTime, the group by operator will register a timer for each record 
> coming at different time which cause performance problem. The reasoning for 
> having two parameters is that we can avoid to register many timers if we have 
> more freedom when to discard state. As min equals to max cause performance 
> problem it is better to make sure these two parameters are not same.
> Any suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9681) Make sure minRetentionTime not equal to maxRetentionTime

2018-07-04 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532876#comment-16532876
 ] 

ASF GitHub Bot commented on FLINK-9681:
---

GitHub user hequn8128 opened a pull request:

https://github.com/apache/flink/pull/6255

[FLINK-9681] [table] Make sure difference between minRetentionTime and 
maxRetentionTime at least 5 minutes


## What is the purpose of the change

This PR aims to make sure difference between minRetentionTime and 
maxRetentionTime at least 5 minutes. Currently, for a group by(or other 
operators), if minRetentionTime equals to maxRetentionTime, the group by 
operator will register a timer for each record coming at different time which 
cause performance problem. The reasoning for having two parameters is that we 
can avoid to register many timers if we have more freedom when to discard 
state. As min equals to max cause performance problem it is better to make sure 
these two parameters are not same.


## Brief change log

  - Throw exception when difference between minRetentionTime and 
maxRetentionTime smaller than  5 minutes
  - Add a `QueryConfigTest` class extends from StreamQueryConfig. 
`QueryConfigTest` don't have the 5min limitation which makes test more 
convenient.
  - Adapt tests.


## Verifying this change

This change added tests and can be verified as follows:
  - Added test that validates that min and max retention time are set 
correctly.

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): (no)
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
  - The serializers: (no)
  - The runtime per-record code paths (performance sensitive): (no)
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
  - The S3 file system connector: (no)

## Documentation

  - Does this pull request introduce a new feature? (no)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hequn8128/flink retentionTime

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/6255.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #6255


commit b41df60dde422bce9317479c8d77304d6ee4857b
Author: hequn8128 
Date:   2018-07-04T14:11:38Z

[FLINK-9681] [table] Make sure difference between minRetentionTime and 
maxRetentionTime at least 5 minutes




> Make sure minRetentionTime not equal to maxRetentionTime
> 
>
> Key: FLINK-9681
> URL: https://issues.apache.org/jira/browse/FLINK-9681
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API  SQL
>Reporter: Hequn Cheng
>Assignee: Hequn Cheng
>Priority: Major
>  Labels: pull-request-available
>
> Currently, for a group by(or other operators), if minRetentionTime equals to 
> maxRetentionTime, the group by operator will register a timer for each record 
> coming at different time which cause performance problem. The reasoning for 
> having two parameters is that we can avoid to register many timers if we have 
> more freedom when to discard state. As min equals to max cause performance 
> problem it is better to make sure these two parameters are not same.
> Any suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9681) Make sure minRetentionTime not equal to maxRetentionTime

2018-07-03 Thread Hequn Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531322#comment-16531322
 ] 

Hequn Cheng commented on FLINK-9681:


Hi [~fhueske], thanks for your reply and suggestions. I will create a pull 
request laster. 

> Make sure minRetentionTime not equal to maxRetentionTime
> 
>
> Key: FLINK-9681
> URL: https://issues.apache.org/jira/browse/FLINK-9681
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API  SQL
>Reporter: Hequn Cheng
>Assignee: Hequn Cheng
>Priority: Major
>
> Currently, for a group by(or other operators), if minRetentionTime equals to 
> maxRetentionTime, the group by operator will register a timer for each record 
> coming at different time which cause performance problem. The reasoning for 
> having two parameters is that we can avoid to register many timers if we have 
> more freedom when to discard state. As min equals to max cause performance 
> problem it is better to make sure these two parameters are not same.
> Any suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9681) Make sure minRetentionTime not equal to maxRetentionTime

2018-07-02 Thread Fabian Hueske (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529850#comment-16529850
 ] 

Fabian Hueske commented on FLINK-9681:
--

Hi [~hequn8128],

You are right. Setting both parameters to the same value results in very bad 
performance. 
I think we should throw an exception when setting the values in the query 
configuration and point out that there should be a significant difference 
between min and max (at least 1 minute?)

In the long term we can migrate state cleanup to the currently developed state 
TTL and might get away from min and max.

> Make sure minRetentionTime not equal to maxRetentionTime
> 
>
> Key: FLINK-9681
> URL: https://issues.apache.org/jira/browse/FLINK-9681
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API  SQL
>Reporter: Hequn Cheng
>Assignee: Hequn Cheng
>Priority: Major
>
> Currently, for a group by(or other operators), if minRetentionTime equals to 
> maxRetentionTime, the group by operator will register a timer for each record 
> coming at different time which cause performance problem. The reasoning for 
> having two parameters is that we can avoid to register many timers if we have 
> more freedom when to discard state. As min equals to max cause performance 
> problem it is better to make sure these two parameters are not same.
> Any suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)