Re: Left join

2017-08-15 Thread Prabeesh K.
Hi Manu,

It is helpful.

Thanks,
Prabeesh K.

On 16 August 2017 at 08:25, Manu Zhang  wrote:

> Hi Prabeesh,
>
> You may checkout https://github.com/apache/beam/blob/master/sdks/
> java/extensions/join-library/src/main/java/org/apache/beam/
> sdk/extensions/joinlibrary/Join.java for reference.
>
> Thanks,
> Manu
>
> On Mon, Jul 31, 2017 at 3:40 PM Prabeesh K.  wrote:
>
>> pcoll1 = [('key1', [values]), ('key2', [values])]
>>
>> pcoll2 = [('key1', value), ('key3', value)]
>>
>>
>> On 31 July 2017 at 11:21, Prabeesh K.  wrote:
>>
>>> Hi,
>>>
>>> help me to improve the left Joiner. Is this the right way to join the
>>> Pcollection in the beam ?
>>>
>>>
>>> pcoll1 = ..
>>> pcoll2 = ..
>>>
>>> left_joined = (
>>> {'left': pcoll1, 'right': pcoll2}
>>> | 'LeftJoiner: Combine' >> beam.CoGroupByKey()
>>> | 'LeftJoiner: ExtractValues' >> beam.Values()
>>> | 'LeftJoiner: JoinValues' >> beam.ParDo(LeftJoinerFn())
>>> )
>>>
>>> class LeftJoinerFn(beam.DoFn):
>>>
>>> def __init__(self):
>>> super(LeftJoinerFn, self).__init__()
>>>
>>> def process(self, row, **kwargs):
>>>
>>> left = row['left']
>>> right = row['right']
>>>
>>> if left and right:
>>> for each in left:
>>> yield each + right[0]
>>>
>>> elif left:
>>> for each in left:
>>> yield each
>>>
>>>
>>


Re: Slack channel

2017-08-15 Thread Manu Zhang
Invitation sent. Welcome.

On Wed, Aug 16, 2017 at 11:41 AM shen yu  wrote:

> Hi, I'd like to join the Slack channel for Apache Beam. I work at Klook
> and would like to get involved in the Apache Beam community. My email is
> you...@klook.com
>


Using external python packges in pipeline

2017-08-15 Thread shen yu
Hi, I'm using apache-beam python sdk for running pipelines. How do I
install 3rd party packages and use them in my pipeline? I always get
ImportError: No module named ... Do I have to create a template and specify
staging_location, temp_location... in order to use external packages?
(right now I'm using Google cloud shell to run pipelines and install
packages by running pip install)

p.s. I'm not sure if this is the right place to ask platform-specific
(Google cloud Dataflow) questions?

Thanks in advance

youxun


Slack channel

2017-08-15 Thread shen yu
Hi, I'd like to join the Slack channel for Apache Beam. I work at Klook and
would like to get involved in the Apache Beam community. My email is
you...@klook.com


Re: Resampling a timeserie stuck on a GroupByKey

2017-08-15 Thread Lukasz Cwik
I have invited you to the slack channel.

2 million data points doesn't seem like it should be an issue.
Have you considered trying a simpler combiner like Count to see if the
bottleneck is with the combiner that you are supplying?
Also, could you share the code for what resample_function does?

On Mon, Aug 14, 2017 at 2:43 AM, Tristan Marechaux <
tristan.marech...@walnut-algo.com> wrote:

> Hi all,
>
> I wrote a Beam Pipeline written with the python SDK that resample a
> timeseries containing data points everery minute to a 5-minutes timeserie.
>
> My pipeline looks like:
> input_data | 
> WindowInto(FixedWindows(size=timedelta(minutes=5).total_seconds()))
> | CombineGlobaly(resample_function)
>
> When I run it with the local or DataFlow runner with a small dataset, it
> works and does what I want.
>
> But when I try to run it on the DataFlow runner with a bigger dataset (1
> 700 000 datapoints timestamped over 15 years) it stay stuck for hours on
> the GroupByKey step of CombineGlobaly.
>
> My question is : Did I do something wrong with the design of my pipeline?
>
> PS: Can someone invite me to the slack channel?
> --
>
> Tristan Marechaux
>
> Data Scientist | *Walnut Algorithms*
>
> Mobile : +33 627804399 <+33627804399>
>
> Email: tristan.marech...@walnut-algo.com
>
> Web: www.walnutalgorithms.com
>