Re: Feature Generation for Large datasets composed of many time series

2017-07-24 Thread Lukasz Cwik
The more ids the better as this increases the parallelism in your pipeline. Also, the aggregations that you listed like min/max/average are very efficient operations to perform on datasets. Cassandra is already supported: https://github.com/apache/beam/tree/master/sdks/java/io/cassandra Using a

Re: Feature Generation for Large datasets composed of many time series

2017-07-24 Thread julio . cesare
Ok thanks ! That's exactly the kind of thing I was imagining with Apache BEAM. I still have a few questions. - regarding performances will this be efficient ? Even with large "window" / many id / values / timestamps ... ? - my goal after all this is to store it in cassandra and/or use the

Re: Feature Generation for Large datasets composed of many time series

2017-07-23 Thread Lukasz Cwik
You can do this efficiently with Apache Beam but you would need to write code which converts a users expression into a set of PTransforms or create a few pipeline variants for commonly computed outcomes. There are already many transforms which can compute things like min, max, average. Take a look