Re: Regarding Beam Slack Channel

2018-03-13 Thread Lukasz Cwik
Invite sent, welcome. On Tue, Mar 13, 2018 at 11:08 AM, Ramjee Ganti wrote: > Hello, > > I am using Apache Beam and Dataflow for the last few months and Can > someone please add me to the Beam slack channel? > > Thanks > Ramjee > http://ramjeeganti.com >

Sum over many keys, over TB of parquet, from HDFS (S3)

2018-03-13 Thread Guillaume Balaine
Hello Beamers, I have been a Beam advocate for a while now, and am trying to use it for batch jobs as well as streaming jobs. I am trying to prove that it can be as fast as Spark for simple use cases. Currently, I have a Spark job that processes a sum + count over a TB of parquet files that runs

Re: Sum over many keys, over TB of parquet, from HDFS (S3)

2018-03-13 Thread Marián Dvorský
Hi Guillaume, You may want to avoid the final join by using CombineFns.compose() instead. Marian On Tue, Mar 13, 2018 at 9:07 PM Guillaume Balaine wrote: >

Re: Use shared invites for Beam Slack Channel

2018-03-13 Thread Lukasz Cwik
Thanks Dan, I generated the invite URL: https://join.slack.com/t/apachebeam/shared_invite/enQtMzI4ODYzODY3MTY5LTIxOTJmMmFkMGVkMThhYmIwOWRkMTFiOGI3NDdlYzNmMmE2ZTA4N2JiMjc5ZDNmYTgxZGY5OTNlMDljMzM5NDU and opened https://issues.apache.org/jira/browse/BEAM-3846 to update the Apache Beam website so

Regarding Beam Slack Channel

2018-03-13 Thread Ramjee Ganti
Hello, I am using Apache Beam and Dataflow for the last few months and Can someone please add me to the Beam slack channel? Thanks Ramjee http://ramjeeganti.com

Re: HDFS data locality and distribution, Flink

2018-03-13 Thread Aljoscha Krettek
Hi, There should be no data-locality awareness with Beam on Flink because there are no APIs in Beam that Flink could use to schedule tasks with awareness. It seems it just happens that the readers are distributed as they are. Are the files roughly of equal size? Best, Aljoscha > On 12. Mar

Getting Ready for the Apache Community Summit @ San Francisco, CA

2018-03-13 Thread Griselda Cuevas
Hi Everyone, As you might remember from this thread [1] we're hosting the first Apache Beam Community Summit in San Francisco tomorrow. I've prepared a notes document [2] so that people can read after the sessions. Additionally, folks who cannot attend can add questions starting now so we can

Use shared invites for Beam Slack Channel

2018-03-13 Thread Dan Halperin
Slack now has a publicly-endorsed way to post a public shared invite: https://my.slack.com/admin/shared_invites The downside is that it apparently has to be renewed monthly. Maybe Beam should use that, instead? Tradeoffs are not obvious, but it seems a win: * forget to renew -> people can't