Re: Sum over many keys, over TB of parquet, from HDFS (S3)

2018-03-13 Thread Marián Dvorský
Hi Guillaume, You may want to avoid the final join by using CombineFns.compose() instead. Marian On Tue, Mar 13, 2018 at 9:07 PM Guillaume Balaine wrote: > Hello Beamers, > > I

Sum over many keys, over TB of parquet, from HDFS (S3)

2018-03-13 Thread Guillaume Balaine
Hello Beamers, I have been a Beam advocate for a while now, and am trying to use it for batch jobs as well as streaming jobs. I am trying to prove that it can be as fast as Spark for simple use cases. Currently, I have a Spark job that processes a sum + count over a TB of parquet files that runs i