RE: Pull Request : https://github.com/apache/beam/pull/6540

I have been doing some work on a generalized set of timeseries transforms, with 
the goal to abstract the user from the process of dealing with some of the 
common problems when working with timeseries in BEAM batch /  stream mode. 
Would love to get feedback, comments, ideas and I hope, after things flesh out 
more, collaborators! Of course it will not cover all issues in the timeseries 
problem space, but from many interactions and discussions over the last couple 
of years, I feel it has the potential to help with a large enough set of use 
cases to make it worthwhile endeavor. 

Primary goals:
Remove as much "boilerplate" as possible form common timeseries pre-processing 
tasks.
Deal with a couple of the harder problems with timeseries when processed as a 
stream in a distributed system. Some example use cases (which we use state api 
and timers to solve):
IOT : A device sends signals when something changes but nothing if there has 
been no update to save battery. The absence of data downstream does not mean 
that there is no information, it's just not been observed. (Of course it could 
be the IOT device went boom.. but in the absence of new data, the last known 
value is assumed until some ttl is reached).
Finance :  Ticks in fx finance data will come with Ask and Bid prices as they 
change, if however no ASK or BID price is seen the last known value is assumed.
Provide some common sinks as reference, for example output of Tensorflow 
Sequence Examples onto storage systems. The initial sinks in the pull requests 
are based on Google Cloud sinks, but this should be expanded to other platforms 
I hope with the help of some of the good folks on this thread! 

In order to make this a tractable problem, there are some fundamental 
assumptions that have been made. 

The raw timeseries data will translate to a common representation. The first 
pass of this is below. Users main 'coding task' will be to convert their 
objects to :

Single property
https://github.com/rezarokni/beam/blob/timeseries/sdks/java/extensions/timeseries/src/main/proto/TimeSeriesData.proto#L66

Multivariate: 
https://github.com/rezarokni/beam/blob/timeseries/sdks/java/extensions/timeseries/src/main/proto/TimeSeriesData.proto#L75

The primary utility of this library is for stream processing. While it will 
work fine in batch mode there are many already established tools for dealing 
with timeseries data that has already landed in a data store. 
This library is not intended as a data analytics tool, although the output of 
the library has potential to be very useful within analytics tools it is a side 
benefit.

Would be great to get feedback and if you are interested in helping more 
directly please ping.

Cheers

Reza

Reply via email to