[jira] [Commented] (APEXMALHAR-2244) Optimize WindowedStorage and Spillable data structures for time series

2016-10-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15623126#comment-15623126
 ] 

ASF GitHub Bot commented on APEXMALHAR-2244:


Github user asfgit closed the pull request at:

https://github.com/apache/apex-malhar/pull/434


> Optimize WindowedStorage and Spillable data structures for time series
> --
>
> Key: APEXMALHAR-2244
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2244
> Project: Apache Apex Malhar
>  Issue Type: Sub-task
>Reporter: David Yan
>Assignee: Siyuan Hua
> Fix For: 3.6.0
>
>
> The spillable data structures currently does not make any assumption about 
> the key that is used in Managed State, and as a result, it uses 
> ManagedStateImpl to interface with Managed State and uses time buckets that 
> are based on the apex window id. But for WindowedStorage used by 
> WindowedOperator, the key to the storage is a window, which is event time 
> based. Using the default ManagedStateImpl would be very inefficient for event 
> time based keys, since it would write data that would belong to the same 
> window to different time buckets.
> On a high level, the below summarizes roughly what needs to be done:
> 1. a way to tell the spillable data structures to use the 
> ManagedTimeUnifiedStateImpl
> 2. a way to tell the spillable data structures how to extract the timestamp 
> from the key. Note that in the case of WindowedOperator, the timestamp should 
> be the end timestamp of the window (beginTimeMillis + durationMillis), not 
> the begin timestamp.
> 3. a way to tell the spillable data structures how to assign the time bucket 
> given that timestamp
> 4. with point 3, the spillable implementations of WindowedStorage will need 
> to take a config parameter that says how much time (in millis) is each time 
> bucket
> 5. only purge a time bucket when all keys that belong to that time bucket are 
> removed and the apex window id of the first window in which the keys are all 
> removed has been committed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXMALHAR-2244) Optimize WindowedStorage and Spillable data structures for time series

2016-09-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15533445#comment-15533445
 ] 

ASF GitHub Bot commented on APEXMALHAR-2244:


GitHub user siyuanh opened a pull request:

https://github.com/apache/apex-malhar/pull/434

APEXMALHAR-2244 Use TimeUnifiedManageStateStore for Spillable Data Structure

@davidyan74  Very first version just for SpillableMapImpl, please review

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/siyuanh/apex-malhar timeseries

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-malhar/pull/434.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #434


commit 7dda03ac058e05e133b962b46da3228558d5ff8f
Author: Siyuan Hua 
Date:   2016-09-29T17:32:13Z

First commit




> Optimize WindowedStorage and Spillable data structures for time series
> --
>
> Key: APEXMALHAR-2244
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2244
> Project: Apache Apex Malhar
>  Issue Type: Sub-task
>Reporter: David Yan
>Assignee: Siyuan Hua
>
> The spillable data structures currently does not make any assumption about 
> the key that is used in Managed State, and as a result, it uses 
> ManagedStateImpl to interface with Managed State and uses time buckets that 
> are based on the apex window id. But for WindowedStorage used by 
> WindowedOperator, the key to the storage is a window, which is event time 
> based. Using the default ManagedStateImpl would be very inefficient for event 
> time based keys, since it would write data that would belong to the same 
> window to different time buckets.
> On a high level, the below summarizes roughly what needs to be done:
> 1. a way to tell the spillable data structures to use the 
> ManagedTimeUnifiedStateImpl
> 2. a way to tell the spillable data structures how to extract the timestamp 
> from the key. Note that in the case of WindowedOperator, the timestamp should 
> be the end timestamp of the window (beginTimeMillis + durationMillis), not 
> the begin timestamp.
> 3. a way to tell the spillable data structures how to assign the time bucket 
> given that timestamp
> 4. with point 3, the spillable implementations of WindowedStorage will need 
> to take a config parameter that says how much time (in millis) is each time 
> bucket
> 5. only purge a time bucket when all keys that belong to that time bucket are 
> removed and the apex window id of the first window in which the keys are all 
> removed has been committed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXMALHAR-2244) Optimize WindowedStorage and Spillable data structures for time series

2016-09-22 Thread Siyuan Hua (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15513770#comment-15513770
 ] 

Siyuan Hua commented on APEXMALHAR-2244:


Each spillable DS implementation use a SpillableStateStore to store things and 
we can make ManagedTimeUnifiedStateImpl implement the store as well and it can 
take some time extract function to get/calculate time and time buckets from 
each V/KV data.  And the Store can be setup by the WindowedOperator, correct? 

> Optimize WindowedStorage and Spillable data structures for time series
> --
>
> Key: APEXMALHAR-2244
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2244
> Project: Apache Apex Malhar
>  Issue Type: Sub-task
>Reporter: David Yan
>Assignee: Siyuan Hua
>
> The spillable data structures currently does not make any assumption about 
> the key that is used in Managed State, and as a result, it uses 
> ManagedStateImpl to interface with Managed State and uses time buckets that 
> are based on the apex window id. But for WindowedStorage used by 
> WindowedOperator, the key to the storage is a window, which is event time 
> based. Using the default ManagedStateImpl would be very inefficient for event 
> time based keys, since it would write data that would belong to the same 
> window to different time buckets.
> On a high level, the below summarizes roughly what needs to be done:
> 1. a way to tell the spillable data structures to use the 
> ManagedTimeUnifiedStateImpl
> 2. a way to tell the spillable data structures how to extract the timestamp 
> from the key. Note that in the case of WindowedOperator, the timestamp should 
> be the end timestamp of the window (beginTimeMillis + durationMillis), not 
> the begin timestamp.
> 3. a way to tell the spillable data structures how to assign the time bucket 
> given that timestamp
> 4. with point 3, the spillable implementations of WindowedStorage will need 
> to take a config parameter that says how much time (in millis) is each time 
> bucket
> 5. only purge a time bucket when all keys that belong to that time bucket are 
> removed and the apex window id of the first window in which the keys are all 
> removed has been committed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXMALHAR-2244) Optimize WindowedStorage and Spillable data structures for time series

2016-09-20 Thread Thomas Weise (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15507450#comment-15507450
 ] 

Thomas Weise commented on APEXMALHAR-2244:
--

How will the spillable DS know which time bucket to look for a given key in 
(get)? That's so that we don't look into multiple time buckets since it is not 
necessary in this case and the number of time buckets can also be high.

> Optimize WindowedStorage and Spillable data structures for time series
> --
>
> Key: APEXMALHAR-2244
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2244
> Project: Apache Apex Malhar
>  Issue Type: Sub-task
>Reporter: David Yan
>Assignee: Siyuan Hua
>
> The spillable data structures currently does not make any assumption about 
> the key that is used in Managed State, and as a result, it uses 
> ManagedStateImpl to interface with Managed State and uses time buckets that 
> are based on the apex window id. But for WindowedStorage used by 
> WindowedOperator, the key to the storage is a window, which is event time 
> based. Using the default ManagedStateImpl would be very inefficient for event 
> time based keys, since it would write data that would belong to the same 
> window to different time buckets.
> On a high level, the below summarizes roughly what needs to be done:
> 1. a way to tell the spillable data structures to use the 
> ManagedTimeUnifiedStateImpl
> 2. a way to tell the spillable data structures how to extract the timestamp 
> from the key. Note that in the case of WindowedOperator, the timestamp should 
> be the end timestamp of the window (beginTimeMillis + durationMillis), not 
> the begin timestamp.
> 3. a way to tell the spillable data structures how to assign the time bucket 
> given that timestamp
> 4. only purge a time bucket when all keys that belong to that time bucket are 
> removed and the apex window id of the first window in which the keys are all 
> removed has been committed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)