Re: Left join
Hi Manu, It is helpful. Thanks, Prabeesh K. On 16 August 2017 at 08:25, Manu Zhangwrote: > Hi Prabeesh, > > You may checkout https://github.com/apache/beam/blob/master/sdks/ > java/extensions/join-library/src/main/java/org/apache/beam/ > sdk/extensions/joinlibrary/Join.java for reference. > > Thanks, > Manu > > On Mon, Jul 31, 2017 at 3:40 PM Prabeesh K. wrote: > >> pcoll1 = [('key1', [values]), ('key2', [values])] >> >> pcoll2 = [('key1', value), ('key3', value)] >> >> >> On 31 July 2017 at 11:21, Prabeesh K. wrote: >> >>> Hi, >>> >>> help me to improve the left Joiner. Is this the right way to join the >>> Pcollection in the beam ? >>> >>> >>> pcoll1 = .. >>> pcoll2 = .. >>> >>> left_joined = ( >>> {'left': pcoll1, 'right': pcoll2} >>> | 'LeftJoiner: Combine' >> beam.CoGroupByKey() >>> | 'LeftJoiner: ExtractValues' >> beam.Values() >>> | 'LeftJoiner: JoinValues' >> beam.ParDo(LeftJoinerFn()) >>> ) >>> >>> class LeftJoinerFn(beam.DoFn): >>> >>> def __init__(self): >>> super(LeftJoinerFn, self).__init__() >>> >>> def process(self, row, **kwargs): >>> >>> left = row['left'] >>> right = row['right'] >>> >>> if left and right: >>> for each in left: >>> yield each + right[0] >>> >>> elif left: >>> for each in left: >>> yield each >>> >>> >>
Re: Slack channel
Invitation sent. Welcome. On Wed, Aug 16, 2017 at 11:41 AM shen yuwrote: > Hi, I'd like to join the Slack channel for Apache Beam. I work at Klook > and would like to get involved in the Apache Beam community. My email is > you...@klook.com >
Using external python packges in pipeline
Hi, I'm using apache-beam python sdk for running pipelines. How do I install 3rd party packages and use them in my pipeline? I always get ImportError: No module named ... Do I have to create a template and specify staging_location, temp_location... in order to use external packages? (right now I'm using Google cloud shell to run pipelines and install packages by running pip install) p.s. I'm not sure if this is the right place to ask platform-specific (Google cloud Dataflow) questions? Thanks in advance youxun
Slack channel
Hi, I'd like to join the Slack channel for Apache Beam. I work at Klook and would like to get involved in the Apache Beam community. My email is you...@klook.com
Re: Resampling a timeserie stuck on a GroupByKey
I have invited you to the slack channel. 2 million data points doesn't seem like it should be an issue. Have you considered trying a simpler combiner like Count to see if the bottleneck is with the combiner that you are supplying? Also, could you share the code for what resample_function does? On Mon, Aug 14, 2017 at 2:43 AM, Tristan Marechaux < tristan.marech...@walnut-algo.com> wrote: > Hi all, > > I wrote a Beam Pipeline written with the python SDK that resample a > timeseries containing data points everery minute to a 5-minutes timeserie. > > My pipeline looks like: > input_data | > WindowInto(FixedWindows(size=timedelta(minutes=5).total_seconds())) > | CombineGlobaly(resample_function) > > When I run it with the local or DataFlow runner with a small dataset, it > works and does what I want. > > But when I try to run it on the DataFlow runner with a bigger dataset (1 > 700 000 datapoints timestamped over 15 years) it stay stuck for hours on > the GroupByKey step of CombineGlobaly. > > My question is : Did I do something wrong with the design of my pipeline? > > PS: Can someone invite me to the slack channel? > -- > > Tristan Marechaux > > Data Scientist | *Walnut Algorithms* > > Mobile : +33 627804399 <+33627804399> > > Email: tristan.marech...@walnut-algo.com > > Web: www.walnutalgorithms.com >