Getting Ready for the Apache Community Summit @ San Francisco, CA

2018-03-13 Thread Griselda Cuevas
Hi Everyone,

As you might remember from this thread [1] we're hosting the first Apache
Beam Community Summit in San Francisco tomorrow.

I've prepared a notes document [2] so that people can read after the
sessions. Additionally, folks who cannot attend can add questions starting
now so we can address them during or after the summit.

I'll report back with a summary after the event. Thanks to everyone who is
coming!

Gris Cuevas

[1] https://lists.apache.org/list.html?user@beam.apache.org:lte=2M:summit
[2]
https://docs.google.com/document/d/1B4EU8jjZy9TnlRiZWW9hSCqegbmOh8boIgjkAyOfnCk/edit#


Re: Use shared invites for Beam Slack Channel

2018-03-13 Thread Lukasz Cwik
Thanks Dan,

I generated the invite URL:
https://join.slack.com/t/apachebeam/shared_invite/enQtMzI4ODYzODY3MTY5LTIxOTJmMmFkMGVkMThhYmIwOWRkMTFiOGI3NDdlYzNmMmE2ZTA4N2JiMjc5ZDNmYTgxZGY5OTNlMDljMzM5NDU

and opened https://issues.apache.org/jira/browse/BEAM-3846 to update the
Apache Beam website so people can join themselves.


On Tue, Mar 13, 2018 at 12:48 PM Dan Halperin  wrote:

> Slack now has a publicly-endorsed way to post a public shared invite:
> https://my.slack.com/admin/shared_invites
>
> The downside is that it apparently has to be renewed monthly.
>
> Maybe Beam should use that, instead? Tradeoffs are not obvious, but it
> seems a win:
>
> * forget to renew -> people can't sign up (and reach out to user@?)
> vs
> * now, always force them to mail user@
>
> Dan
>
> On Tue, Mar 13, 2018 at 11:54 AM, Lukasz Cwik  wrote:
>
>> Invite sent, welcome.
>>
>> On Tue, Mar 13, 2018 at 11:08 AM, Ramjee Ganti 
>> wrote:
>>
>>> Hello,
>>>
>>> I am using Apache Beam and Dataflow for the last few months and Can
>>> someone please add me to the Beam slack channel?
>>>
>>> Thanks
>>> Ramjee
>>> http://ramjeeganti.com
>>>
>>
>>
>


Re: Sum over many keys, over TB of parquet, from HDFS (S3)

2018-03-13 Thread Marián Dvorský
Hi Guillaume,

You may want to avoid the final join by using CombineFns.compose()

 instead.

Marian

On Tue, Mar 13, 2018 at 9:07 PM Guillaume Balaine  wrote:

> Hello Beamers,
>
> I have been a Beam advocate for a while now, and am trying to use it for
> batch jobs as well as streaming jobs.
> I am trying to prove that it can be as fast as Spark for simple use cases.
> Currently, I have a Spark job that processes a sum + count over a TB of
> parquet files that runs in roughly 90 min.
> Using the same resources (on EMR or on Mesos) I can't even come close to
> that.
> My job with Scio/Beam and Flink processes sums roughly 20 Billion rows of
> parquet in 3h with 144 parallelism and 1TB of Ram (although I suspect many
> operators are idle, so I should probably use less parallelism with the same
> amount of cores).
> I also implemented an identical version in pure Java because I am unsure
> whether or not the Kryo encoded tuples are properly managed by the Flink
> memory optimizations. And am also testing it on Spark and Apex.
>
> My point is, is there anyway to optimize this simple process :
> HadoopFileIO (using parquet and specific avro coders to improve perf over
> Generic) ->
> Map to KV of (field1 str, field2 str, field3 str) (value double, 1)
> ordered by most discriminating to least -> Combine.perKey(Sum)
> Or value and then join Sum and Count with a TupledPCollection
> -> AvroIO.Write
>
> The equivalent Spark Job does a group by key, and then a sum.
>
> Are there some tricks I am missing here ?
>
> Thanks in advance for your help.
>


Sum over many keys, over TB of parquet, from HDFS (S3)

2018-03-13 Thread Guillaume Balaine
Hello Beamers,

I have been a Beam advocate for a while now, and am trying to use it for
batch jobs as well as streaming jobs.
I am trying to prove that it can be as fast as Spark for simple use cases.
Currently, I have a Spark job that processes a sum + count over a TB of
parquet files that runs in roughly 90 min.
Using the same resources (on EMR or on Mesos) I can't even come close to
that.
My job with Scio/Beam and Flink processes sums roughly 20 Billion rows of
parquet in 3h with 144 parallelism and 1TB of Ram (although I suspect many
operators are idle, so I should probably use less parallelism with the same
amount of cores).
I also implemented an identical version in pure Java because I am unsure
whether or not the Kryo encoded tuples are properly managed by the Flink
memory optimizations. And am also testing it on Spark and Apex.

My point is, is there anyway to optimize this simple process :
HadoopFileIO (using parquet and specific avro coders to improve perf over
Generic) ->
Map to KV of (field1 str, field2 str, field3 str) (value double, 1) ordered
by most discriminating to least -> Combine.perKey(Sum)
Or value and then join Sum and Count with a TupledPCollection
-> AvroIO.Write

The equivalent Spark Job does a group by key, and then a sum.

Are there some tricks I am missing here ?

Thanks in advance for your help.


Use shared invites for Beam Slack Channel

2018-03-13 Thread Dan Halperin
Slack now has a publicly-endorsed way to post a public shared invite:
https://my.slack.com/admin/shared_invites

The downside is that it apparently has to be renewed monthly.

Maybe Beam should use that, instead? Tradeoffs are not obvious, but it
seems a win:

* forget to renew -> people can't sign up (and reach out to user@?)
vs
* now, always force them to mail user@

Dan

On Tue, Mar 13, 2018 at 11:54 AM, Lukasz Cwik  wrote:

> Invite sent, welcome.
>
> On Tue, Mar 13, 2018 at 11:08 AM, Ramjee Ganti 
> wrote:
>
>> Hello,
>>
>> I am using Apache Beam and Dataflow for the last few months and Can
>> someone please add me to the Beam slack channel?
>>
>> Thanks
>> Ramjee
>> http://ramjeeganti.com
>>
>
>


Re: Regarding Beam Slack Channel

2018-03-13 Thread Lukasz Cwik
Invite sent, welcome.

On Tue, Mar 13, 2018 at 11:08 AM, Ramjee Ganti 
wrote:

> Hello,
>
> I am using Apache Beam and Dataflow for the last few months and Can
> someone please add me to the Beam slack channel?
>
> Thanks
> Ramjee
> http://ramjeeganti.com
>


Regarding Beam Slack Channel

2018-03-13 Thread Ramjee Ganti
Hello,

I am using Apache Beam and Dataflow for the last few months and Can someone
please add me to the Beam slack channel?

Thanks
Ramjee
http://ramjeeganti.com


Re: HDFS data locality and distribution, Flink

2018-03-13 Thread Aljoscha Krettek
Hi,

There should be no data-locality awareness with Beam on Flink because there are 
no APIs in Beam that Flink could use to schedule tasks with awareness. It seems 
it just happens that the readers are distributed as they are.

Are the files roughly of equal size?

Best,
Aljoscha

> On 12. Mar 2018, at 05:50, Reinier Kip  wrote:
> 
> Relevant versions: Beam 2.1, Flink 1.3.
> From: Reinier Kip 
> Sent: 12 March 2018 13:46:24
> To: user@beam.apache.org
> Subject: HDFS data locality and distribution, Flink
>  
> Hey all,
> 
> I'm trying to batch-process 30-ish files from HDFS, but I see that data is 
> distributed very badly across slots. 4 out of 32 slots get 4/5ths of the 
> data, another 3 slots get about 1/5th and a last slot just a few records. 
> This probably triggers disk spillover on these slots and slows down the job 
> immensely. The data has many many unique keys and processing could be done in 
> a highly parallel manner. From what I understand, HDFS data locality governs 
> which splits are assigned to which subtask.
> 
> I'm running a Beam on Flink on YARN pipeline.
> I'm reading 30-ish files, whose records are later grouped by their millions 
> of unique keys.
> For now, I have 8 task managers by 4 slots. Beam sets all subtasks to have 32 
> parallelism.
> Data seems to be localised to 9 out of the 32 slots, 3 out of the 8 task 
> managers.
> 
> Does the statement of input split assignment ring true? Is the fact that data 
> isn't redistributed an effort from Flink to have high data locality, even if 
> this means disk spillover for a few slots/tms and idleness for others? Is 
> there any use for parallelism if work isn't distributed anyway?
> 
> Thanks for your time, Reinier