The more ids the better as this increases the parallelism in your pipeline.
Also, the aggregations that you listed like min/max/average are very
efficient operations to perform on datasets.
Cassandra is already supported:
https://github.com/apache/beam/tree/master/sdks/java/io/cassandra
Using a
Ok thanks !
That's exactly the kind of thing I was imagining with Apache BEAM.
I still have a few questions.
- regarding performances will this be efficient ? Even with large
"window" / many id / values / timestamps ... ?
- my goal after all this is to store it in cassandra and/or use the
You can do this efficiently with Apache Beam but you would need to write
code which converts a users expression into a set of PTransforms or create
a few pipeline variants for commonly computed outcomes. There are already
many transforms which can compute things like min, max, average. Take a
look