Re: Streaming inserts BQ with Java SDK Beam

2019-05-07 Thread Pablo Estrada
Hi Andres!
You can definitely do streaming inserts using the Java SDK. This is
available with BigQueryIO.write(). Specifically, you can use the
`withMethod`[1] call to specify whether you want batch loads or streaming
inserts. If you specify streaming inserts, Beam should insert rows as they
come in bundles.
Hope that helps
-P.

[1]
https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.html#withMethod-org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.Method-

On Tue, May 7, 2019 at 9:58 AM Andres Angel 
wrote:

> Hello everyone,
>
> I need to use BigQuery inserts within my beam pipeline, hence I know well
> the built-in IO options offer `BigQueryIO`, however this will insert in a
> batch fashion to BQ creating underneath a BQ load job. I instead need to
> trigger a streaming insert into BQ, and I was reviewing the Java SDK
> documentation but seems like this is not possible.
>
> In the other hand, I have the python SDK and I found this GitHub
> documentation
> 
> code where they are using a method *InsertAll
> *
>  which
> is apparently what I need. If this is official I would like to know if
> there is a naive fashion to trigger stream inserts in BQ using the Java SDK.
>
> thanks so much for your feedback
> AU
>


[ANNOUNCEMENT] Common Pipeline Patterns - new section in the documentation + contributions welcome

2019-06-07 Thread Pablo Estrada
Hello everyone,
A group of community members has been working on gathering and providing
common pipeline patterns for pipelines in Beam. These are examples on how
to perform certain operations, and useful ways of using Beam in your
pipelines. Some of them relate to processing of files, use of side inputs,
sate/timers, etc. Check them out[1].

These initial patterns have been chosen based on evidence gathered from
StackOverflow, and from talking to users of Beam.

It would be great if this section could grow, and be useful to many Beam
users. For that reason, we invite anyone to share patterns, and pipeline
examples that they have used in the past. If you are interested in
contributing, please submit a pull request, or get in touch with Cyrus
Maden, Reza Rokni, Melissa Pashniak or myself.

Thanks!
Best
-P.

[1] https://beam.apache.org/documentation/patterns/overview/


Live fixing of a Beam bug on July 25 at 3:30pm-4:30pm PST

2019-07-16 Thread Pablo Estrada
Hello all,
I'll be having a session where I live-fix a Beam bug for 1 hour next week.
Everyone is invited.

It will be on July 25, between 3:30pm and 4:30pm PST. Hopefully I will
finish a full change in that time frame, but we'll see.

I have not yet decided if I will do this via hangouts, or via a youtube
livestream. In any case, I will share the link here in the next few days.

I will most likely work on the Java SDK (I have a little feature request in
mind).

Thanks!
-P.


Re: Live fixing of a Beam bug on July 25 at 3:30pm-4:30pm PST

2019-07-17 Thread Pablo Estrada
Yes! So I will be working on a small feature request for Java's BigQueryIO:
https://issues.apache.org/jira/browse/BEAM-7607

Maybe I'll do something for Python next month. : )
Best
-P.

On Wed, Jul 17, 2019 at 12:32 PM Rakesh Kumar  wrote:

> +1, I really appreciate this initiative. It would be really helpful
> newbies like me.
>
> Is it possible to list out what are the things that you are planning to
> cover?
>
>
>
>
> On Tue, Jul 16, 2019 at 11:19 AM Yichi Zhang  wrote:
>
>> Thanks for organizing this Pablo, it'll be very helpful!
>>
>> On Tue, Jul 16, 2019 at 10:57 AM Pablo Estrada 
>> wrote:
>>
>>> Hello all,
>>> I'll be having a session where I live-fix a Beam bug for 1 hour next
>>> week. Everyone is invited.
>>>
>>> It will be on July 25, between 3:30pm and 4:30pm PST. Hopefully I will
>>> finish a full change in that time frame, but we'll see.
>>>
>>> I have not yet decided if I will do this via hangouts, or via a youtube
>>> livestream. In any case, I will share the link here in the next few days.
>>>
>>> I will most likely work on the Java SDK (I have a little feature request
>>> in mind).
>>>
>>> Thanks!
>>> -P.
>>>
>>


Re: [python SDK] Returning Pub/Sub message_id and timestamp

2019-07-19 Thread Pablo Estrada
Beam 2.14.0 will include support for writing files in the fileio module
(the support will include GCS, local files, HDFS). It will also support
streaming. The transform is still marked as experimental, and is likely to
receive improvements - but you can check it out for your pipelines, and see
if it helps you : )
Best
-P.

On Fri, Jul 19, 2019 at 12:24 PM Valentyn Tymofieiev 
wrote:

> As of today, Beam Python streaming does not support writing to GCS yet,
> which  explains
> https://stackoverflow.com/questions/54745869/how-to-create-a-dataflow-pipeline-from-pub-sub-to-gcs-in-python
>  .
>
> You are right - id_label and timestamp_attribute does not work on Direct
> runner yet as per https://issues.apache.org/jira/browse/BEAM-4275, I
> checked with a few folks and that seems to be the current status, but you
> can still give them a try on Dataflow runner.
>
> You may also find the following examples helpful:
>
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/streaming_wordcount.py
> (streaming pipeline).
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/bigquery_tornadoes.py
>  (batch
> pipeline)
>
> https://stackoverflow.com/questions/46854167/dataflow-streaming-using-python-sdk-transform-for-pubsub-messages-to-bigquery-o
> .
>
> On Mon, Jul 15, 2019 at 2:12 AM Matthew Darwin <
> matthew.dar...@carfinance247.co.uk> wrote:
>
>> Hi Valentyn,
>>
>> Thank you for your reply. I'm already using the with_attributes=True
>> option, however this returns the attributes property of the JSON, i.e :-
>>
>> {
>>>   "attributes": {
>>> "source": "python"
>>>   }
>>>
>>
>>
>> My pipeline currently looks like this (id_label is commented out when
>> running directly, as it causes a not implemented error):-
>>
>> messages = (p
>> | 'Read From Pub Sub' >> ReadFromPubSub(subscription
>> =known_args.input_subscription,with_attributes=True)
>> #,id_label='message_id')
>> | 'Parse JSON' >> beam.Map(format_message_element)
>>
>> My function to parse the message looks like this:-
>>
>> def format_message_element(message, timestamp=beam.DoFn.TimestampParam):
>> messagedict = json.loads(message.data)
>> rownumber = messagedict['rownumber']
>> fullmessage = {'data' : json.dumps(message.data),
>> 'rownumber' : int(rownumber),
>> 'attributes' : json.dumps(message.attributes),
>> 'timestamp' : float(timestamp)}
>>
>> logging.info(message.attributes)
>> logging.info(message)
>>
>> return (rownumber, fullmessage)
>>
>> I'm aware there are the id_label and with_timestamp parameters for the
>> ReadFromPubSub method, however, these don't seem to work with the direct
>> runner, as per
>> https://issues.apache.org/jira/browse/BEAM-4275?jql=text%20~%20%22python%20id_label%22
>> which makes testing somewhat difficult.
>>
>> My full code is attached, when running above 2.9.0 of the SDK I can't get
>> passed the windowing function, due to an issue that appears related to this
>> https://stackoverflow.com/questions/54745869/how-to-create-a-dataflow-pipeline-from-pub-sub-to-gcs-in-python
>> and this
>> https://stackoverflow.com/questions/55109403/apache-beam-python-sdk-upgrade-issue
>> as I was receiving the following error when running on 2.12.0:
>>
>> Cannot convert GlobalWindow to
>> apache_beam.utils.windowed_value._IntervalWindowBase [while running
>> 'generatedPtransform-150']
>>
>> On 2.9.0 when running on the local runner, I receive the following output
>> from the logging.info calls in format_message_element:
>>
>> INFO:root:{u'source': u'python'}
>> INFO:root:PubsubMessage({"rownumber": 1}, {u'source': u'python'})
>>
>> I was expecting the messageId and publishTime as part of the object
>> returned; but as you can see there's nothing there for those attributes.
>>
>> (The code does not quite map correctly to the BigQuery table so it fails
>> inserts at that point, which I'm currently trying to resolve!)
>>
>> Kind regards,
>>
>> Matthew
>>
>>
>>
>> On Fri, 2019-07-12 at 09:13 -0700, Valentyn Tymofieiev wrote:
>>
>> *This message originated from outside your organization*
>> --
>> Hi Matthew,
>>
>> Welcome to Beam!
>>
>> Looking at Python PubSub IO API, you should be able to access id and
>> timestamp by setting `with_attributes=True` when using `ReadFromPubSub`
>> PTransform, see [1,2].
>>
>> [1]
>> https://github.com/apache/beam/blob/0fce2b88660f52dae638697e1472aa108c982ae6/sdks/python/apache_beam/io/gcp/pubsub.py#L61
>> [2]
>> https://github.com/apache/beam/blob/0fce2b88660f52dae638697e1472aa108c982ae6/sdks/python/apache_beam/io/gcp/pubsub.py#L138
>>
>> On Fri, Jul 12, 2019 at 1:36 AM Matthew Darwin <
>> matthew.dar...@carfinance247.co.uk> wrote:
>>
>> Good morning,
>>
>> I'm very new to Beam, and pretty new to Python so please first accept my
>> apologies for any obvious misconceptions/mistakes in the following.
>>
>> I am currently trying to develop a sample pipeline in Python to pull
>> messages from Pub/Sub and then write 

Re: Live fixing of a Beam bug on July 25 at 3:30pm-4:30pm PST

2019-07-20 Thread Pablo Estrada
Hello all,

This will be streamed on youtube on this link:
https://www.youtube.com/watch?v=xpIpEO4PUDo

I think there will be a live chat, so I will hopefully be available to
answer questions. To be honest, my workflow is not super efficient, but...
oh well, hopefully it will be at least somewhat helpful to others : )
Best
-P.

On Thu, Jul 18, 2019 at 12:59 AM Tim Sell  wrote:

> +1, I'd love to see this as a recording. Will you stick it up on youtube
> afterwards?
>
> On Thu, Jul 18, 2019 at 4:00 AM sridhar inuog 
> wrote:
>
>> Thanks, Pablo! Looking forward to it! Hopefully, it will also be recorded
>> as well.
>>
>> On Wed, Jul 17, 2019 at 2:50 PM Pablo Estrada  wrote:
>>
>>> Yes! So I will be working on a small feature request for Java's
>>> BigQueryIO: https://issues.apache.org/jira/browse/BEAM-7607
>>>
>>> Maybe I'll do something for Python next month. : )
>>> Best
>>> -P.
>>>
>>> On Wed, Jul 17, 2019 at 12:32 PM Rakesh Kumar 
>>> wrote:
>>>
>>>> +1, I really appreciate this initiative. It would be really helpful
>>>> newbies like me.
>>>>
>>>> Is it possible to list out what are the things that you are planning to
>>>> cover?
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jul 16, 2019 at 11:19 AM Yichi Zhang  wrote:
>>>>
>>>>> Thanks for organizing this Pablo, it'll be very helpful!
>>>>>
>>>>> On Tue, Jul 16, 2019 at 10:57 AM Pablo Estrada 
>>>>> wrote:
>>>>>
>>>>>> Hello all,
>>>>>> I'll be having a session where I live-fix a Beam bug for 1 hour next
>>>>>> week. Everyone is invited.
>>>>>>
>>>>>> It will be on July 25, between 3:30pm and 4:30pm PST. Hopefully I
>>>>>> will finish a full change in that time frame, but we'll see.
>>>>>>
>>>>>> I have not yet decided if I will do this via hangouts, or via a
>>>>>> youtube livestream. In any case, I will share the link here in the next 
>>>>>> few
>>>>>> days.
>>>>>>
>>>>>> I will most likely work on the Java SDK (I have a little feature
>>>>>> request in mind).
>>>>>>
>>>>>> Thanks!
>>>>>> -P.
>>>>>>
>>>>>


Choosing a coder for a class that contains a Row?

2019-07-22 Thread Pablo Estrada
Hello all,
I am writing a utility to push data to PubSub. My data class looks
something like so:
==
class MyData {
  String someId;
  Row someRow;
  Row someOtherRow;
}
==
The schema for the Rows is not known a-priori. It is contained by the Row.
I am then pushing this data to pubsub:
===
MyData pushingData = 
WhatCoder? coder = 

ByteArrayOutputStream os = new ByteArrayOutputStream();
coder.encode(this, os);

pubsubClient.connect();
pubsubClient.push(PubSubMessage.newBuilder().setData(os.toByteArray()).build());
pubsubClient.close();
=
What's the right coder to use in this case? I don't know if SchemaCoder
will work, because it seems that it requires the Row's schema a priori. I
have not been able to make AvroCoder work.

Any tips?
Best
-P.


Re: Live fixing of a Beam bug on July 25 at 3:30pm-4:30pm PST

2019-07-23 Thread Pablo Estrada
Hi all,
I've just realized that https://issues.apache.org/jira/browse/BEAM-7607 is
a single-line change - and we'd spend 40 minutes chitchatting, so I'll also
be working on https://jira.apache.org/jira/browse/BEAM-7803, which is a
Python issue (also for the BigQuery sink!).
Thanks!
-P.

On Sat, Jul 20, 2019 at 2:05 PM Pablo Estrada  wrote:

> Hello all,
>
> This will be streamed on youtube on this link:
> https://www.youtube.com/watch?v=xpIpEO4PUDo
>
> I think there will be a live chat, so I will hopefully be available to
> answer questions. To be honest, my workflow is not super efficient, but...
> oh well, hopefully it will be at least somewhat helpful to others : )
> Best
> -P.
>
> On Thu, Jul 18, 2019 at 12:59 AM Tim Sell  wrote:
>
>> +1, I'd love to see this as a recording. Will you stick it up on youtube
>> afterwards?
>>
>> On Thu, Jul 18, 2019 at 4:00 AM sridhar inuog 
>> wrote:
>>
>>> Thanks, Pablo! Looking forward to it! Hopefully, it will also be
>>> recorded as well.
>>>
>>> On Wed, Jul 17, 2019 at 2:50 PM Pablo Estrada 
>>> wrote:
>>>
>>>> Yes! So I will be working on a small feature request for Java's
>>>> BigQueryIO: https://issues.apache.org/jira/browse/BEAM-7607
>>>>
>>>> Maybe I'll do something for Python next month. : )
>>>> Best
>>>> -P.
>>>>
>>>> On Wed, Jul 17, 2019 at 12:32 PM Rakesh Kumar 
>>>> wrote:
>>>>
>>>>> +1, I really appreciate this initiative. It would be really helpful
>>>>> newbies like me.
>>>>>
>>>>> Is it possible to list out what are the things that you are planning
>>>>> to cover?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jul 16, 2019 at 11:19 AM Yichi Zhang 
>>>>> wrote:
>>>>>
>>>>>> Thanks for organizing this Pablo, it'll be very helpful!
>>>>>>
>>>>>> On Tue, Jul 16, 2019 at 10:57 AM Pablo Estrada 
>>>>>> wrote:
>>>>>>
>>>>>>> Hello all,
>>>>>>> I'll be having a session where I live-fix a Beam bug for 1 hour next
>>>>>>> week. Everyone is invited.
>>>>>>>
>>>>>>> It will be on July 25, between 3:30pm and 4:30pm PST. Hopefully I
>>>>>>> will finish a full change in that time frame, but we'll see.
>>>>>>>
>>>>>>> I have not yet decided if I will do this via hangouts, or via a
>>>>>>> youtube livestream. In any case, I will share the link here in the next 
>>>>>>> few
>>>>>>> days.
>>>>>>>
>>>>>>> I will most likely work on the Java SDK (I have a little feature
>>>>>>> request in mind).
>>>>>>>
>>>>>>> Thanks!
>>>>>>> -P.
>>>>>>>
>>>>>>


Re: Choosing a coder for a class that contains a Row?

2019-07-23 Thread Pablo Estrada
+dev 
Thanks Ryan! This is quite helpful. Still not what I need : ) - but useful.

The data is change data capture from databases, and I'm putting it into a
Beam Row. The schema for the Row is generally homogeneous, but subject to
change at some point in the future if the schema in the database changes.
It's unusual and unlikely, but possible. I have no idea how Beam deals with
evolving schemas. +Reuven Lax  is there documentation /
examples / anything around this? : )

I think evolving schemas is an interesting question

For now, I am going to Java-serialize the objects, and delay figuring this
out. But I reckon I'll have to come back to this...

Best
-P.

On Tue, Jul 23, 2019 at 1:07 AM Ryan Skraba  wrote:

> Hello Pablo!  Just to clarify -- the Row schemas aren't known at
> pipeline construction time, but can be discovered from the instance of
> MyData?
>
> Once discovered, is the schema "homogeneous" for all instance of
> MyData?  (i.e. someRow will always have the same schema for all
> instances afterwards, and there won't be another someRow with a
> different schema).
>
> We've encountered a parallel "problem" with pure Avro data, where the
> instance is a GenericRecord containing it's own Avro schema but
> *without* knowing the schema until the pipeline is run.  The solution
> that we've been using is a bit hacky, but we're using an ad hoc
> per-job schema registry and a custom coder where each worker saves the
> schema in the `encode` before writing the record, and loads it lazily
> in the `decode` before reading.
>
> The original code is available[1] (be gentle, it was written with Beam
> 0.4.0-incubating... and has continued to work until now).
>
> In practice, the ad hoc schema registry is just a server socket in the
> Spark driver, in-memory for DirectRunner / local mode, and a a
> read/write to a known location in other runners.  There are definitely
> other solutions with side-inputs and providers, and the job server in
> portability looks like an exciting candidate for per-job schema
> registry story...
>
> I'm super eager to see if there are other ideas or a contribution we
> can make in this area that's "Beam Row" oriented!
>
> Ryan
>
> [1]
> https://github.com/Talend/components/blob/master/core/components-adapter-beam/src/main/java/org/talend/components/adapter/beam/coders/LazyAvroCoder.java
>
> On Tue, Jul 23, 2019 at 12:49 AM Pablo Estrada  wrote:
> >
> > Hello all,
> > I am writing a utility to push data to PubSub. My data class looks
> something like so:
> > ==
> > class MyData {
> >   String someId;
> >   Row someRow;
> >   Row someOtherRow;
> > }
> > ==
> > The schema for the Rows is not known a-priori. It is contained by the
> Row. I am then pushing this data to pubsub:
> > ===
> > MyData pushingData = 
> > WhatCoder? coder = 
> >
> > ByteArrayOutputStream os = new ByteArrayOutputStream();
> > coder.encode(this, os);
> >
> > pubsubClient.connect();
> >
> pubsubClient.push(PubSubMessage.newBuilder().setData(os.toByteArray()).build());
> > pubsubClient.close();
> > =
> > What's the right coder to use in this case? I don't know if SchemaCoder
> will work, because it seems that it requires the Row's schema a priori. I
> have not been able to make AvroCoder work.
> >
> > Any tips?
> > Best
> > -P.
>


Re: Live fixing of a Beam bug on July 25 at 3:30pm-4:30pm PST

2019-07-25 Thread Pablo Estrada
The link is here: https://www.youtube.com/watch?v=xpIpEO4PUDo
This is still happening.

On Thu, Jul 25, 2019 at 2:55 PM Innocent Djiofack 
wrote:

> Did I miss the link or this was postponed?
>
> On Tue, Jul 23, 2019 at 3:05 PM Austin Bennett <
> whatwouldausti...@gmail.com> wrote:
>
>> Pablo,
>>
>> Assigned  https://issues.apache.org/jira/browse/BEAM-7607 to you, to
>> make even more likely that it is still around on the 25th :-)
>>
>> Cheers,
>> Austin
>>
>> On Tue, Jul 23, 2019 at 11:24 AM Pablo Estrada 
>> wrote:
>>
>>> Hi all,
>>> I've just realized that https://issues.apache.org/jira/browse/BEAM-7607 is
>>> a single-line change - and we'd spend 40 minutes chitchatting, so I'll also
>>> be working on https://jira.apache.org/jira/browse/BEAM-7803, which is a
>>> Python issue (also for the BigQuery sink!).
>>> Thanks!
>>> -P.
>>>
>>> On Sat, Jul 20, 2019 at 2:05 PM Pablo Estrada 
>>> wrote:
>>>
>>>> Hello all,
>>>>
>>>> This will be streamed on youtube on this link:
>>>> https://www.youtube.com/watch?v=xpIpEO4PUDo
>>>>
>>>> I think there will be a live chat, so I will hopefully be available to
>>>> answer questions. To be honest, my workflow is not super efficient, but...
>>>> oh well, hopefully it will be at least somewhat helpful to others : )
>>>> Best
>>>> -P.
>>>>
>>>> On Thu, Jul 18, 2019 at 12:59 AM Tim Sell  wrote:
>>>>
>>>>> +1, I'd love to see this as a recording. Will you stick it up on
>>>>> youtube afterwards?
>>>>>
>>>>> On Thu, Jul 18, 2019 at 4:00 AM sridhar inuog 
>>>>> wrote:
>>>>>
>>>>>> Thanks, Pablo! Looking forward to it! Hopefully, it will also be
>>>>>> recorded as well.
>>>>>>
>>>>>> On Wed, Jul 17, 2019 at 2:50 PM Pablo Estrada 
>>>>>> wrote:
>>>>>>
>>>>>>> Yes! So I will be working on a small feature request for Java's
>>>>>>> BigQueryIO: https://issues.apache.org/jira/browse/BEAM-7607
>>>>>>>
>>>>>>> Maybe I'll do something for Python next month. : )
>>>>>>> Best
>>>>>>> -P.
>>>>>>>
>>>>>>> On Wed, Jul 17, 2019 at 12:32 PM Rakesh Kumar 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1, I really appreciate this initiative. It would be really helpful
>>>>>>>> newbies like me.
>>>>>>>>
>>>>>>>> Is it possible to list out what are the things that you are
>>>>>>>> planning to cover?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jul 16, 2019 at 11:19 AM Yichi Zhang 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks for organizing this Pablo, it'll be very helpful!
>>>>>>>>>
>>>>>>>>> On Tue, Jul 16, 2019 at 10:57 AM Pablo Estrada 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hello all,
>>>>>>>>>> I'll be having a session where I live-fix a Beam bug for 1 hour
>>>>>>>>>> next week. Everyone is invited.
>>>>>>>>>>
>>>>>>>>>> It will be on July 25, between 3:30pm and 4:30pm PST. Hopefully I
>>>>>>>>>> will finish a full change in that time frame, but we'll see.
>>>>>>>>>>
>>>>>>>>>> I have not yet decided if I will do this via hangouts, or via a
>>>>>>>>>> youtube livestream. In any case, I will share the link here in the 
>>>>>>>>>> next few
>>>>>>>>>> days.
>>>>>>>>>>
>>>>>>>>>> I will most likely work on the Java SDK (I have a little feature
>>>>>>>>>> request in mind).
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>> -P.
>>>>>>>>>>
>>>>>>>>>
>
> --
>
> *DJIOFACK INNOCENT*
> *"Be better than the day before!" -*
> *+1 404 751 8024*
>


Re: Live fixing of a Beam bug on July 25 at 3:30pm-4:30pm PST

2019-07-25 Thread Pablo Estrada
Thanks for those who tuned in : ) - I feel like I might have spent too long
fiddling with Python code, and not long enough doing setup, testing, etc. I
will try to do another one where I just test / setup the environment / lint
checks etc.

Here are links for:
Setting up the Python environment: https://youtu.be/xpIpEO4PUDo?t=334
Quickly setting up the Java environment: https://youtu.be/xpIpEO4PUDo?t=3659

Doing a Pull Request: https://youtu.be/xpIpEO4PUDo?t=3770

On Thu, Jul 25, 2019 at 4:39 PM sridhar inuog 
wrote:

> Thanks, Pablo for organizing this session. I found it useful.
>
> On Thu, Jul 25, 2019 at 4:56 PM Pablo Estrada  wrote:
>
>> The link is here: https://www.youtube.com/watch?v=xpIpEO4PUDo
>> This is still happening.
>>
>> On Thu, Jul 25, 2019 at 2:55 PM Innocent Djiofack 
>> wrote:
>>
>>> Did I miss the link or this was postponed?
>>>
>>> On Tue, Jul 23, 2019 at 3:05 PM Austin Bennett <
>>> whatwouldausti...@gmail.com> wrote:
>>>
>>>> Pablo,
>>>>
>>>> Assigned  https://issues.apache.org/jira/browse/BEAM-7607 to you, to
>>>> make even more likely that it is still around on the 25th :-)
>>>>
>>>> Cheers,
>>>> Austin
>>>>
>>>> On Tue, Jul 23, 2019 at 11:24 AM Pablo Estrada 
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>> I've just realized that
>>>>> https://issues.apache.org/jira/browse/BEAM-7607 is a single-line
>>>>> change - and we'd spend 40 minutes chitchatting, so I'll also be working 
>>>>> on
>>>>> https://jira.apache.org/jira/browse/BEAM-7803, which is a Python
>>>>> issue (also for the BigQuery sink!).
>>>>> Thanks!
>>>>> -P.
>>>>>
>>>>> On Sat, Jul 20, 2019 at 2:05 PM Pablo Estrada 
>>>>> wrote:
>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>> This will be streamed on youtube on this link:
>>>>>> https://www.youtube.com/watch?v=xpIpEO4PUDo
>>>>>>
>>>>>> I think there will be a live chat, so I will hopefully be available
>>>>>> to answer questions. To be honest, my workflow is not super efficient,
>>>>>> but... oh well, hopefully it will be at least somewhat helpful to others 
>>>>>> : )
>>>>>> Best
>>>>>> -P.
>>>>>>
>>>>>> On Thu, Jul 18, 2019 at 12:59 AM Tim Sell  wrote:
>>>>>>
>>>>>>> +1, I'd love to see this as a recording. Will you stick it up on
>>>>>>> youtube afterwards?
>>>>>>>
>>>>>>> On Thu, Jul 18, 2019 at 4:00 AM sridhar inuog <
>>>>>>> sridharin...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks, Pablo! Looking forward to it! Hopefully, it will also be
>>>>>>>> recorded as well.
>>>>>>>>
>>>>>>>> On Wed, Jul 17, 2019 at 2:50 PM Pablo Estrada 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Yes! So I will be working on a small feature request for Java's
>>>>>>>>> BigQueryIO: https://issues.apache.org/jira/browse/BEAM-7607
>>>>>>>>>
>>>>>>>>> Maybe I'll do something for Python next month. : )
>>>>>>>>> Best
>>>>>>>>> -P.
>>>>>>>>>
>>>>>>>>> On Wed, Jul 17, 2019 at 12:32 PM Rakesh Kumar <
>>>>>>>>> rakeshku...@lyft.com> wrote:
>>>>>>>>>
>>>>>>>>>> +1, I really appreciate this initiative. It would be really
>>>>>>>>>> helpful newbies like me.
>>>>>>>>>>
>>>>>>>>>> Is it possible to list out what are the things that you are
>>>>>>>>>> planning to cover?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 16, 2019 at 11:19 AM Yichi Zhang 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks for organizing this Pablo, it'll be very helpful!
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jul 16, 2019 at 10:57 AM Pablo Estrada <
>>>>>>>>>>> pabl...@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello all,
>>>>>>>>>>>> I'll be having a session where I live-fix a Beam bug for 1 hour
>>>>>>>>>>>> next week. Everyone is invited.
>>>>>>>>>>>>
>>>>>>>>>>>> It will be on July 25, between 3:30pm and 4:30pm PST. Hopefully
>>>>>>>>>>>> I will finish a full change in that time frame, but we'll see.
>>>>>>>>>>>>
>>>>>>>>>>>> I have not yet decided if I will do this via hangouts, or via a
>>>>>>>>>>>> youtube livestream. In any case, I will share the link here in the 
>>>>>>>>>>>> next few
>>>>>>>>>>>> days.
>>>>>>>>>>>>
>>>>>>>>>>>> I will most likely work on the Java SDK (I have a little
>>>>>>>>>>>> feature request in mind).
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks!
>>>>>>>>>>>> -P.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>
>>> --
>>>
>>> *DJIOFACK INNOCENT*
>>> *"Be better than the day before!" -*
>>> *+1 404 751 8024*
>>>
>>


[Java] Accessing state from FinishBundle method

2019-07-29 Thread Pablo Estrada
Hello all,
I am working on a pipeline where I'd like to write a value to state at the
end of processing a bundle. As it turns out, I don't think this is
possible, as FinishBundleContext does not provide a method for it; and
doing something like so also errors out:

==
 @FinishBundle
  public void finishBundle(
  FinishBundleContext c,
  @StateId("myState") MapState myState) {

 myState.put(this.myClassVarKey, this.myClassVarValue);
  }
==

Maybe this is not yet implemented, and remains to be? Or maybe there's some
other way of doing it? Or maybe this is an antipattern? (though I doubt
that last one).
Thanks!
-P.


Re: Is there any job board for Beam with Java?

2019-09-05 Thread Pablo Estrada
Hi deepak,
this doesn't count as spamming, so don't feel bad for posting the question
: ). Your question is an appropriate one for this mailing list.

That being said, I do not know of a Beam job board. You can probably find
jobs with Beam or Apache Beam as a keyword on LinkedIn. Have you tried that?

Best
-P.

On Thu, Sep 5, 2019 at 7:20 AM deepak kumar  wrote:

> Hi All,
> Sorry to spam.
> I am writing to check if there is any job board where we can post or look
> for open positions/  jobs in Apache Beam?
>
> Thanks
> Deepak
>


Re: Python WriteToBigQuery with FILE_LOAD & additional_bq_parameters not working

2019-09-05 Thread Pablo Estrada
Hi Zdenko,
sorry about the confusion. The reason behind this is that we have not
jumped tu fully change the batch behavior of WriteToBigQuery, so to use
BigQueryBatchFileLoads as the implementation of WriteToBigQuery, you need
to pass 'use_beam_bq_sink' as an experiment to activate it.
As you rightly figured out, you can use BigQueryBatchFileLoads directly.
Best
-P.

On Thu, Sep 5, 2019 at 6:06 AM Zdenko Hrcek  wrote:

> Thanks for the code sample,
>
> when I switched to use bigquery_file_loads.BigQueryBatchFileLoads instead
> of bigquery.WriteToBigQuery it works ok now. Not sure why with
> WriteToBigQuery doesn't work, since it's using BigQueryBatchFileLoads under
> the hood...
>
> Thanks for the help.
> Zdenko
> ___
>  http://www.the-swamp.info
>
>
>
> On Wed, Sep 4, 2019 at 6:55 PM Chamikara Jayalath 
> wrote:
>
>> +Pablo Estrada  who added this.
>>
>> I don't think we have tested this specific option but I believe
>> additional BQ parameters option was added in a generic way to accept all
>> additional parameters.
>>
>> Looking at the code, seems like additional parameters do get passed
>> through to load jobs:
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L427
>>
>> One thing you can try out is trying to run a BQ load job directly with
>> the same set of data and options to see if the data gets loaded.
>>
>> Thanks,
>> Cham
>>
>> On Tue, Sep 3, 2019 at 2:24 PM Zdenko Hrcek  wrote:
>>
>>> Greetings,
>>>
>>> I am using Beam 2.15 and Python 2.7.
>>> I am doing a batch job to load data from CSV and upload to BigQuery. I
>>> like functionality that instead of streaming to BigQuery I can use "file
>>> load", to load table all at once.
>>>
>>> For my case, there are few "bad" records in the input (it's geo data and
>>> during manual upload, BigQuery doesn't accept those as valid geography
>>> records. this is easily solved by setting the number of max bad records.
>>> If I understand correctly, WriteToBigQuery supports
>>> "additional_bq_parameters", but for some reason when running a pipeline on
>>> Dataflow runner it looks like those settings are ignored.
>>>
>>> I played with an example from the documentation
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
>>>  with
>>> gist file
>>> https://gist.github.com/zdenulo/99877307981b4d372df5a662d581a5df
>>> where the table should be created on the partitioned field and
>>> clustered, but when running on Dataflow it doesn't happen.
>>> When I run on DirectRunner it works as expected. interestingly, when I
>>> add maxBadRecords parameter to additional_bq_parameters, DirectRunner
>>> complains that it doesn't recognize that option.
>>>
>>> This is the first time using this setup/combination so I'm just
>>> wondering if I overlooked something. I would appreciate any help.
>>>
>>> Best regards,
>>> Zdenko
>>>
>>>
>>> ___
>>>  http://www.the-swamp.info
>>>
>>>


Beam meetup Seattle!! September 26th, 6pm

2019-09-23 Thread Pablo Estrada
Hello everyone!

If you are in the Seattle area please come to Beam meetup this Thursday,
September 26th - at 6pm in the Google office in Fremont. There will be
interesting talks, and there should be a number of Beam contributors and
users around. Also pizza and drinks.

The page with al the info:
https://www.meetup.com/Seattle-Apache-Beam-Meetup/events/263845364/

I hope to see a lot of you around, and will be happy to chat about all
things Batch, Streaming, Beam, end of summer, etc.

Best,
-P.


Re: beam.io.BigQuerySource does not accept value providers

2019-10-18 Thread Pablo Estrada
Hi Theodore!
Display data is what's throwing the error, but the BigQuerySource does not
support value providers even despite that issue because it's a Dataflow
native source. Unfortunately, this is not currently possible.
Currently, you could do this executing a BQ export job (using a DoFn), and
use fileio to consume those newly exported files. We may prioritize
building a source for that, but it is not there ATM.
Best
-P.

On Fri, Oct 18, 2019 at 6:42 AM Theodore Siu  wrote:

> Additionally, for reference
> https://stackoverflow.com/questions/46595149/dynamic-bigquery-query-in-dataflow-template
>
> On Fri, Oct 18, 2019 at 9:34 AM Theodore Siu  wrote:
>
>> Hi,
>>
>> We are attempting to build a Dataflow template in Beam Python and are
>> running into issues with using a value provider specifically
>> with beam.io.BigQuerySource which throws the following error.ValueError:
>> Invalid DisplayDataItem. Value RuntimeValueProvider(option: input, type:
>> str, default_value: 'test') is of an unsupported type.
>>
>> Tracing the code in Github it seems like the main culprits are the
>> following:
>>
>>
>> https://github.com/apache/beam/blob/d9add564c1c34065829f712074bdd3856b2b0982/sdks/python/apache_beam/io/gcp/bigquery.py#L470
>>
>>
>> https://github.com/apache/beam/blob/d9add564c1c34065829f712074bdd3856b2b0982/sdks/python/apache_beam/transforms/display.py#L244
>>
>>
>> Do we have any idea when a fix can be made?
>>
>> -Theo
>>
>>
>>
>>


Re: real real-time beam

2019-11-25 Thread Pablo Estrada
If I understand correctly - your pipeline has some kind of windowing, and
on every trigger downstream of the combiner, the pipeline updates a cache
with a single, non-windowed value. Is that correct?

What are your keys for this pipeline? You could work this out with, as you
noted, a timer that fires periodically, and keeps some state with the value
that you want to update to the cache.

Is this a Python or Java pipeline? What is the runner?
Best
-P.

On Mon, Nov 25, 2019 at 1:27 PM Aaron Dixon  wrote:

> Suppose I trigger a Combine per-element (in a high-volume stream) and use
> a ParDo as a sink.
>
> I assume there is no guarantee about the order that my ParDo will see
> these triggers, especially as it processes in parallel, anyway.
>
> That said, my sink writes to a db or cache and I would not like the cache
> to ever regress its value to something "before" what it has already written.
>
> Is the best way to solve this problem to always write the event-time in
> the cache and do a compare-and-swap only updating the sink if the triggered
> value in-hand is later than the target value?
>
> Or is there a better way to guarantee that my ParDo sink will process
> elements in-order? (Eg, if I can give up per-event/real-time, then a
> delay-based trigger would probably be sufficient I imagine.)
>
> Thanks for advice!
>


Re: How to store&read offset with kafkaio

2019-12-04 Thread Pablo Estrada
Hi!
What do you mean by offset? Is 'offset' a field in a database table? Or
maybe it's an offset in the database binlog?
Best
-P.

On Wed, Nov 27, 2019 at 7:32 PM 郑 洁锋  wrote:

> Hi,
>I want to store the offset in Oracle/Mysql, and then every time I
> start the real-time streaming task, I want this task to start consuming
> from the offset stored in Oracle/Mysql.
>I haven't seen how to manually store the offset and specify the
> offset to start consuming after reading the documentation.
>
>Can anyone help me
> --
> zjfpla...@hotmail.com
>


Re: Need Help | SpannerIO

2019-12-18 Thread Pablo Estrada
Or perhaps you have a PCollection or something like that, and you
want to use those strings to issue queries to Spanner?

PCollection myStrings = p.apply(.)

PCollection rows = myStrings.apply(
SpannerIO.read()
.withInstanceId(instanceId)
.withDatabaseId(dbId)
.withQuery("SELECT id, name, email FROM users WHERE column = %input%"));


Something like that perhaps?


If that's the case, it looks like you can use SpanerIO.readAll(),
where the input PCollection contains ReadOperations. Something like
this:


myStrings.apply(MapElements.into(TypeDescriptor.of(ReadOperation).via(myString
-> ReadOperation.with...())

 .apply(SpannerIO.readAll()
.withInstanceId(instanceId)
.withDatabaseId(dbId));


You'd have to convert your upstream PCollection into a PCollection of
ReadOperations, and then pass that to SpanerIO.readAll().


On Wed, Dec 18, 2019 at 8:53 AM Luke Cwik  wrote:

> How do you want to use the previous data in the SpannerIO.read()?
>
> Are you trying to perform a join on a key between two PCollections? If so,
> please use CoGroupByKey[1].
> Are you trying to merge two PCollection objects? If so, please use
> Flatten[2].
>
> 1: https://beam.apache.org/documentation/programming-guide/#cogroupbykey
> 2: https://beam.apache.org/documentation/programming-guide/#flatten
>
> On Wed, Dec 18, 2019 at 8:44 AM Ajit Soman 
> wrote:
>
>> Hi,
>>
>> I am creating a pipeline . I want to execute Spanner query once I got
>> data from its previous stage.
>>
>> In java docs, they have given reference for this code.
>>
>> PCollection rows = pipleline.apply(
>> SpannerIO.read()
>> .withInstanceId(instanceId)
>> .withDatabaseId(dbId)
>> .withQuery("SELECT id, name, email FROM users"));
>>
>>
>> *In the above code they have applied SpannerIO query to pipeline object
>> .But i want this to apply in PCollection.apply() method so that i can use
>> its previous stage output in my query.*
>>
>> PCollection rows = (PCollection)pCollection.apply(
>> SpannerIO.read()
>> .withInstanceId(instanceId)
>> .withDatabaseId(dbId)
>> .withQuery("SELECT id, name, email FROM users"));
>>
>>
>> Need your help.
>>
>> Thanks & Regards,
>> Ajit
>>
>>
>>
>> --
>> DISCLAIMER: This email message along with any attachments may contain
>> information that is confidential or privileged. If you are not the intended
>> recipient or responsible for delivering any of this transmission to an
>> intended recipient, you are hereby notified that any dissemination,
>> distribution, retention, copying or other use of this message or its
>> attachments is prohibited. If you received this message in error, please
>> notify the sender immediately and permanently delete all copies of this
>> message and attachments. No representation is made that this email is
>> free of viruses. Virus scanning is recommended and is the responsibility of
>> the recipient. Thank you.
>
>


[FYI] Rephrasing the 'lull'/processing stuck logs

2020-01-09 Thread Pablo Estrada
Hello Beam users and community,

The Beam Python SDK, and Java workers have a utility where they will print
a log message whenever there's an execution thread where no state
transitions happen for over five minutes.

These messages are common in two scenarios:
1. A deadlock happening in the worker (very uncommon, but possible)
2. An operation simply takes over 5 minutes (e.g. a slow RPC, waiting for
an external event, etc).

The old phrasing of these logs has often been a bit confusing, and
led users to think that there was actual stuckness in the pipeline, when
reality was more harmless: an operation was just slow.

I am introducing a change[1] for the Apache Beam SDK to rephrase these
logs, and make them less confusing.

If you ever used these logs for your debugging, know that the string will
change, but the logs will remain : ).
If you didn't know about these, now you do, and hopefully they will be
useful to you! : )

Thanks!
-P.

[1] https://github.com/apache/beam/pull/10446/files


Re: Beam Summit North America 2019 - recordings

2020-01-13 Thread Pablo Estrada
Thanks Matthias!

On Sun, Jan 12, 2020 at 7:51 AM Matthias Baetens 
wrote:

> Hi everyone,
>
> It's our pleasure to share the recordings from the Beam Summit North
> America 2019.
> Please find them in the YouTube playlist
> 
> on the Apache Beam channel
> .
>
> Best regards,
> Matthias
>


Re: [ANNOUNCE] Beam 2.18.0 Released

2020-01-28 Thread Pablo Estrada
Thanks Udi!

On Tue, Jan 28, 2020 at 11:08 AM Rui Wang  wrote:

> Thank you Udi for taking care of Beam 2.18.0 release!
>
>
>
> -Rui
>
> On Tue, Jan 28, 2020 at 10:59 AM Udi Meiri  wrote:
>
>> The Apache Beam team is pleased to announce the release of version 2.18.0.
>>
>> Apache Beam is an open source unified programming model to define and
>> execute data processing pipelines, including ETL, batch and stream
>> (continuous) processing. See https://beam.apache.org
>>
>> You can download the release here:
>>
>> https://beam.apache.org/get-started/downloads/
>>
>> This release includes bug fixes, features, and improvements detailed on
>> the Beam blog: https://beam.apache.org/blog/2020/01/13/beam-2.18.0.html
>>
>> Thanks to everyone who contributed to this release, and we hope you enjoy
>> using Beam 2.18.0.
>> -- Udi Meiri, on behalf of The Apache Beam team
>>
>


Re: 2nd Apache Beam Meetup in Warsaw

2020-01-30 Thread Pablo Estrada
Adding user@ :)

On Thu, Jan 30, 2020, 5:05 AM Michał Walenia 
wrote:

> Hi there,
> we're organizing the second Apache Beam Meetup in Warsaw! It's going to
> take place on 20th February in our office. We're looking for speakers, so
> if anyone is interested in sharing some knowledge - let me know and I'll
> pass it on to the people in charge of organization :)
>
> here's the Meetup link:
> https://www.meetup.com/pl-PL/Warsaw-Apache-Beam-Meetup/events/268293892/
>
> Have a good day and see you in Warsaw!
> Michal
>
> --
>
> Michał Walenia
> Polidea  | Software Engineer
>
> M: +48 791 432 002 <+48791432002>
> E: michal.wale...@polidea.com
>
> Unique Tech
> Check out our projects! 
>


Re: dataflow job was working fine last night and it isn't now

2020-02-04 Thread Pablo Estrada
Hi Alan,
could it be that you're picking up the new Apache Beam 2.19.0 release?
Could you try depending on beam 2.18.0 to see if the issue surfaces when
using the new release?

If something was working and no longer works, it sounds like a bug. This
may have to do with how we pickle (dill / cloudpickle) - see this question
https://stackoverflow.com/questions/42960637/python-3-5-dill-pickling-unpickling-on-different-servers-keyerror-classtype
Best
-P.

On Tue, Feb 4, 2020 at 6:22 AM Alan Krumholz 
wrote:

> Hi,
>
> I was running a dataflow job in GCP last night and it was running fine.
> This morning this same exact job is failing with the following error:
>
> Error message from worker: Traceback (most recent call last): File
> "/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py",
> line 286, in loads return dill.loads(s) File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 275, in loads
> return load(file, ignore, **kwds) File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 270, in load
> return Unpickler(file, ignore=ignore, **kwds).load() File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 472, in load
> obj = StockUnpickler.load(self) File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 577, in
> _load_type return _reverse_typemap[name] KeyError: 'ClassType' During
> handling of the above exception, another exception occurred: Traceback
> (most recent call last): File
> "/usr/local/lib/python3.5/site-packages/dataflow_worker/batchworker.py",
> line 648, in do_work work_executor.execute() File
> "/usr/local/lib/python3.5/site-packages/dataflow_worker/executor.py", line
> 176, in execute op.start() File "apache_beam/runners/worker/operations.py",
> line 649, in apache_beam.runners.worker.operations.DoOperation.start File
> "apache_beam/runners/worker/operations.py", line 651, in
> apache_beam.runners.worker.operations.DoOperation.start File
> "apache_beam/runners/worker/operations.py", line 652, in
> apache_beam.runners.worker.operations.DoOperation.start File
> "apache_beam/runners/worker/operations.py", line 261, in
> apache_beam.runners.worker.operations.Operation.start File
> "apache_beam/runners/worker/operations.py", line 266, in
> apache_beam.runners.worker.operations.Operation.start File
> "apache_beam/runners/worker/operations.py", line 597, in
> apache_beam.runners.worker.operations.DoOperation.setup File
> "apache_beam/runners/worker/operations.py", line 602, in
> apache_beam.runners.worker.operations.DoOperation.setup File
> "/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py",
> line 290, in loads return dill.loads(s) File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 275, in loads
> return load(file, ignore, **kwds) File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 270, in load
> return Unpickler(file, ignore=ignore, **kwds).load() File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 472, in load
> obj = StockUnpickler.load(self) File
> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 577, in
> _load_type return _reverse_typemap[name] KeyError: 'ClassType'
>
>
> If I use a local runner it still runs fine.
> Anyone else experiencing something similar today? (or know how to fix
> this?)
>
> Thanks!
>


Re: dataflow job was working fine last night and it isn't now

2020-02-04 Thread Pablo Estrada
Hm that's odd. No changes to the pipeline? Are you able to share some of
the code?

+Udi Meiri  do you have any idea what could be going on
here?

On Tue, Feb 4, 2020 at 9:25 AM Alan Krumholz 
wrote:

> Hi Pablo,
> This is strange... it doesn't seem to be the last beam release as last
> night it was already using 2.19.0 I wonder if it was some release from the
> DataFlow team (not beam related):
> Job typeBatch
> Job status Succeeded
> SDK version
> Apache Beam Python 3.5 SDK 2.19.0
> Region
> us-central1
> Start timeFebruary 3, 2020 at 9:28:35 PM GMT-8
> Elapsed time5 min 11 sec
>
> On Tue, Feb 4, 2020 at 9:15 AM Pablo Estrada  wrote:
>
>> Hi Alan,
>> could it be that you're picking up the new Apache Beam 2.19.0 release?
>> Could you try depending on beam 2.18.0 to see if the issue surfaces when
>> using the new release?
>>
>> If something was working and no longer works, it sounds like a bug. This
>> may have to do with how we pickle (dill / cloudpickle) - see this question
>> https://stackoverflow.com/questions/42960637/python-3-5-dill-pickling-unpickling-on-different-servers-keyerror-classtype
>> Best
>> -P.
>>
>> On Tue, Feb 4, 2020 at 6:22 AM Alan Krumholz 
>> wrote:
>>
>>> Hi,
>>>
>>> I was running a dataflow job in GCP last night and it was running fine.
>>> This morning this same exact job is failing with the following error:
>>>
>>> Error message from worker: Traceback (most recent call last): File
>>> "/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py",
>>> line 286, in loads return dill.loads(s) File
>>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 275, in loads
>>> return load(file, ignore, **kwds) File
>>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 270, in load
>>> return Unpickler(file, ignore=ignore, **kwds).load() File
>>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 472, in load
>>> obj = StockUnpickler.load(self) File
>>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 577, in
>>> _load_type return _reverse_typemap[name] KeyError: 'ClassType' During
>>> handling of the above exception, another exception occurred: Traceback
>>> (most recent call last): File
>>> "/usr/local/lib/python3.5/site-packages/dataflow_worker/batchworker.py",
>>> line 648, in do_work work_executor.execute() File
>>> "/usr/local/lib/python3.5/site-packages/dataflow_worker/executor.py", line
>>> 176, in execute op.start() File "apache_beam/runners/worker/operations.py",
>>> line 649, in apache_beam.runners.worker.operations.DoOperation.start File
>>> "apache_beam/runners/worker/operations.py", line 651, in
>>> apache_beam.runners.worker.operations.DoOperation.start File
>>> "apache_beam/runners/worker/operations.py", line 652, in
>>> apache_beam.runners.worker.operations.DoOperation.start File
>>> "apache_beam/runners/worker/operations.py", line 261, in
>>> apache_beam.runners.worker.operations.Operation.start File
>>> "apache_beam/runners/worker/operations.py", line 266, in
>>> apache_beam.runners.worker.operations.Operation.start File
>>> "apache_beam/runners/worker/operations.py", line 597, in
>>> apache_beam.runners.worker.operations.DoOperation.setup File
>>> "apache_beam/runners/worker/operations.py", line 602, in
>>> apache_beam.runners.worker.operations.DoOperation.setup File
>>> "/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py",
>>> line 290, in loads return dill.loads(s) File
>>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 275, in loads
>>> return load(file, ignore, **kwds) File
>>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 270, in load
>>> return Unpickler(file, ignore=ignore, **kwds).load() File
>>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 472, in load
>>> obj = StockUnpickler.load(self) File
>>> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 577, in
>>> _load_type return _reverse_typemap[name] KeyError: 'ClassType'
>>>
>>>
>>> If I use a local runner it still runs fine.
>>> Anyone else experiencing something similar today? (or know how to fix
>>> this?)
>>>
>>> Thanks!
>>>
>>


Re: Big Query source and Dataflow template doubt

2020-03-13 Thread Pablo Estrada
Hello Andre!
Unfortunately, the BigQuerySource does not support value providers.
There is a different transform to read from bigquery in
apache_beam.io.gcp.bigquery._ReadFromBigQuery. This one will soon support
it (hopefully by 2.21).
Unfortunately at the moment, it is not possible to change the query
depending on the value providers.
Best
-P.

On Fri, Mar 13, 2020 at 12:46 PM André Rocha Silva <
a.si...@portaltelemedicina.com.br> wrote:

> Hello everybody
>
> I have a query that I am trying to use in dataflow:
>
> ***
>  'QueryTable' >> beam.io.Read(
> beam.io.BigQuerySource(
> dataset='dataset',
> project='project',
> query=f''' SELECT * FROM dataset.table WHERE version =
> '{user_options.version} )
> ***
> When I run it from my terminal with Dataflow runner, it goes as it should
> be. No problem at all.
>
> But when I try to launch it from template the query does not get the
> option. That appears in the step info.
>
> [image: image.png]
>
> I have tried to change from {user_options.version}
> to {user_options.version.get()}. But got this error:
> apache_beam.error.RuntimeValueProviderError: RuntimeValueProvider(option:
> version, type: str, default_value: None).get() not called from a runtime
> context
>
> Does someone know how to correct this?
> Locally, I am using:
> SDK version
> Apache Beam Python 3.7 SDK 2.17.0
>
> Thank you a lot!
>
> --
>
>*ANDRÉ ROCHA SILVA*
>   * DATA ENGINEER*
>   (48) 3181-0611
>
>    /andre-rocha-silva/
> 
> 
> 
> 
>
>


Re: Beam Katas YouTube

2020-03-27 Thread Pablo Estrada
Nice : D

On Fri, Mar 27, 2020 at 12:44 AM Alex Van Boxel  wrote:

> That's nicely done! Congrats, going to share this immediately.
>
> And I actually didn't know where the name Beam came from, now I know :-)
>
>  _/
> _/ Alex Van Boxel
>
>
> On Fri, Mar 27, 2020 at 4:32 AM Henry Suryawirawan <
> hsuryawira...@google.com> wrote:
>
>> Hello,
>>
>> Just would like to share that recently the Apache Beam Katas is featured
>> in the Google Cloud Level Up YouTube video (
>> https://www.youtube.com/watch?v=tzZiC4s_OqE).
>> Hope that it can be useful for the community.
>>
>>
>> Regards,
>> Henry
>>
>>
>>


Re: Try Beam Katas Today

2020-05-12 Thread Pablo Estrada
Sharing Damon's email with the user@ list as well. Thanks Damon!

On Tue, May 12, 2020 at 9:02 PM Damon Douglas 
wrote:

> Hello Everyone,
>
> If you don't already know, there are helpful instructional tools for
> learning the Apache Beam SDKs called Beam Katas hosted on
> https://stepik.org.  Similar to traditional Kata
> , they are meant to be repeated as
> practice.  Before practicing the katas myself, I found myself copy/pasting
> code (Please accept my confession 😎 ).  Now I find myself actually
> composing pipelines.  Just like kata forms, you find them becoming part of
> you.  If you are interested, below are listed the current available katas:
>
> 1.  Java - https://stepik.org/course/54530
>
> 2.  Python -  https://stepik.org/course/54532
>
> 3.  Go (in development) - https://stepik.org/course/70387
>
> If you are absolutely brand new to Beam and it scares you like it scared
> me, come talk to me.
>
> Best,
>
> Damon
>


Re: PaneInfo showing UNKOWN State

2020-05-26 Thread Pablo Estrada
Hi Jayadeep,
Unfortunately, it seems that PaneInfo is not well supported yet on the
local runners: https://issues.apache.org/jira/browse/BEAM-3759

Can you share more about your use case, and what you'd like to do with the
PaneInfo?

On Sat, May 23, 2020 at 10:03 AM Jay  wrote:

> Hi All -
>
> Below is a sample code written in Python which reads data from Pub/Sub and
> tries to determine the PaneInfo for different elements
>
> There are 3 rows of data as shown below
>
> {"country":"USA","user_id": 1,"ts": 0},
>> {"country":"USA","user_id": 2,"ts": 0},
>> {"country":"USA","user_id": 3,"ts": 8}
>
>
> Below is the sample code which modifies the event timestamp for the
> messages
>
>> class AddTimestampDoFn(beam.DoFn):
>>   def process(self, element):
>> unix_timestamp = (datetime.datetime.now()).timestamp() + element["ts"]
>> yield beam.window.TimestampedValue(element, unix_timestamp)
>
>
> Below is a Helper class to retrieve the timestamp. This has no importance
> apart from checking to see if the timestamp has been set correctly
>
>> class AddTimestamp(beam.DoFn):
>>   def process(self, element, timestamp=beam.DoFn.TimestampParam):
>> yield (timestamp.to_utc_datetime(), element)
>
>
> Code below reads the records from Pub/Sub and runs the ParDo's mentioned
> above
>
> data = p | "read"  >>
>> beam.io.ReadFromPubSub(subscription=subscription)
>>  | "JsonConvert"   >> beam.Map(json.loads)
>
> sliding_windows = (
>>   data | 'ConvertIntoUserEvents' >> beam.ParDo(AddTimestampDoFn())
>>| 'AddTimestamp'  >> beam.ParDo(AddTimestamp())
>> )
>
>
> Below is my trigger definition and the implementation
>
>>
>>
>> class ProcessRecord(beam.DoFn):
>>   def process(self, element,
>> window=beam.DoFn.WindowParam,pane_info=beam.DoFn.PaneInfoParam):
>>  # access pane info e.g pane_info.is_first, pane_info.is_last,
>> pane_info.timing
>>  yield (element,
>> datetime.datetime.now(),window.start.to_utc_datetime(),
>> window.end.to_utc_datetime(), pane_info.timing, pane_info.is_first,
>> pane_info.is_last)
>
>
>
>> window_fn  = beam.window.FixedWindows(10)
>> trigger_fn = beam.transforms.trigger.AfterWatermark(early=AfterCount(1))
>> acc_dis_fn = beam.transforms.trigger.AccumulationMode.ACCUMULATING
>> new_final = sliding_windows | "acc_30" >> beam.WindowInto(
>> window_fn,
>> trigger= trigger_fn,
>> accumulation_mode=acc_dis_fn
>> ) | "acc_30_par" >> beam.ParDo(ProcessRecord())
>
>
>
>  When I look at the output I see the below
>
> 0 (datetime.datetime(2020, 5, 23, 16, 44, 9, 598681), {'country': 'USA',
> 'user_id': 1, 'ts': 0}) 2020-05-23 16:44:45.895890 2020-05-23 16:44:00 
> 2020-05-23
> 16:44:10 3 True True
> 1 (datetime.datetime(2020, 5, 23, 16, 44, 9, 995521), {'country': 'USA',
> 'user_id': 2, 'ts': 0}) 2020-05-23 16:44:46.297163 2020-05-23 16:44:00 
> 2020-05-23
> 16:44:10 3 True True
> 2 (datetime.datetime(2020, 5, 23, 16, 44, 17, 995603), {'country': 'USA',
> 'user_id': 3, 'ts': 8}) 2020-05-23 16:44:46.297259 2020-05-23 16:44:10 
> 2020-05-23
> 16:44:20 3 True True
> As can be observed above there are two Windows that have been defined
> which is inline with the data and the FixedWindow strategy
> Window 1 - 2020-05-23 16:44:00, 2020-05-23 16:44:10
> Window 2 - 2020-05-23 16:44:10, 2020-05-23 16:44:20
>
> Couple of questions which I am not able to understand
> 1. Why is the PaneInfo.Timing value returned as "3" (UNKNOWN) instead of
> (EARLY, ON_TIME) ?
> 2. Shouldn't for Window1 and Window 2 there should be two firings one
> EARLY and one ON_TIME ?
> 3. The last two boolean values are is_first and is_last again both have
> been set to TRUE which doesn't look right.
>
>
> Can someone suggest on what can be the issue ?
>
> Thanks,
> Jayadeep
>


Re: ValueProviderOptions and templates

2020-06-17 Thread Pablo Estrada
I believe you don't need to provide it at template construction time, but
at invocation time. Are you having trouble with providing the parameters at
invocation time?
Best
-P.

On Tue, Jun 16, 2020 at 2:22 PM Marco Mistroni  wrote:

> HI all
>  i am creating dataflow jobs using python API by creating templates which
> i then run on gcp.
> So suppose my dataflow job accepts 2 input parameter which i need to
> supply at invocation time.
> Do i need to specify these parameters when i create my template?
> Here' s a sample. suppose i need two parameters
> --year   and  --key
> Which of the two is the correct syntax for creating a template for the job?
>
> python -m my_main  --runner=dataflow --project=xxx-projects
>  --template_location=gs://mm_dataflow_bucket/templates/mytemplate
> --temp_location=gs://mm_dataflow_bucket/temp
> --staging_location=gs://mm_dataflow_bucket/staging  --setup ./setup.py
>
> OR
> python -m my_main  --runner=dataflow --project=xxx-projects
>  --template_location=gs://mm_dataflow_bucket/templates/mytemplate
> --temp_location=gs://mm_dataflow_bucket/temp
> --staging_location=gs://mm_dataflow_bucket/staging  --setup ./setup.py
> --year  --key
>
> My hope with the latter is that the template 'sees' the option and replace
> it with the correct values when i actually invoke the template
> regards
>  Marco
>


Re: Provide credentials for s3 writes

2020-09-29 Thread Pablo Estrada
Hi Ross,
it seems that this feature is missing (e.g. passing a pipeline option with
authentication information for AWS). I'm sorry about that - that's pretty
annoying.
I wonder if you can use the setup.py file to add the default configuration
yourself while we have appropriate support for a pipeline option-based
authentication. Could you try adding this default config on setup.py?
Best
-P.

On Tue, Sep 29, 2020 at 11:16 AM Ross Vandegrift <
ross.vandegr...@cleardata.com> wrote:

> Hello all,
>
> I have a python pipeline that writes data to an s3 bucket.  On my laptop it
> picks up the SDK credentials from my boto3 config and works great.
>
> Is is possible to provide credentials explicitly?  I'd like to use remote
> dataflow runners, which won't have implicit AWS credentials available.
>
> Thanks,
> Ross
>


Re: [ANNOUNCE] Beam 2.25.0 Released

2020-10-26 Thread Pablo Estrada
Thanks Robin!

On Mon, Oct 26, 2020 at 11:06 AM Robin Qiu  wrote:

> The Apache Beam team is pleased to announce the release of version 2.25.0.
>
> Apache Beam is an open source unified programming model to define and
> execute data processing pipelines, including ETL, batch and stream
> (continuous) processing. See: https://beam.apache.org
>
> You can download the release here:
> https://beam.apache.org/get-started/downloads/
>
> This release includes bug fixes, features, and improvements detailed on
> the Beam blog: https://beam.apache.org/blog/beam-2.25.0/
>
> Thanks to everyone who contributed to this release, and we hope you enjoy
> using Beam 2.25.0.
>


Re: October 2020, Beam Community Update

2020-11-03 Thread Pablo Estrada
Hi Alexey,
Do you have any other place in mind? I don't think Brittany has current
plans to publish this elsewhere, but if you have any good ideas, I imagine
she could consider them : )
Best
-P.

On Tue, Nov 3, 2020 at 8:23 AM Alexey Romanenko 
wrote:

> Thanks for doing this!
>
> Is it going to be published somewhere else than only the mailing lists?
>
> On 31 Oct 2020, at 00:03, Brittany Hermann  wrote:
>
> Hi everyone,
>
> Attached is the October 2020 Beam Community Update. The purpose of this
> newsletter is purely community focused - giving folks the visibility to the
> insights of the health and growth of the community. This newsletter will
> not cover project releases or feature updates. Please let me know if you
> have any questions and feedback is welcome!
>
>
> https://docs.google.com/document/d/1_t6xKoOQVwgn2edmRVh1ViudmbnNM3BwZyguKAwwjfA/edit?usp=sharing
>
>
>
>
>
>


Re: [REMOTE WORKSHOPS] Introduction to Apache Beam - remote workshops Dec 3rd and Dec 10th

2020-11-17 Thread Pablo Estrada
+dev  so everyone will know.
This is cool. Thanks Karolina! Will these be an introduction to basic Beam
concepts?
Thanks!
-P.

On Mon, Nov 16, 2020 at 11:52 AM Karolina Rosół 
wrote:

> Hello everyone,
>
> You may not know me but I'm Karolina Rosół, Head of Cloud & OSS at Polidea
> and I'm working with great Apache Beam committers Michał Walenia & Kamil
> Wasilewski who will be carrying out the introductory remote workshops to
> Apache Beam on *Dec 3rd* and *Dec 10th*.
>
> If you're interested in taking part in the workshop, feel free to have a
> look at the Warsaw Beam Meetup
>  page
> or enroll directly -> bit.ly/BeamWorkshops 
>
> Thanks,
>
> Karolina Rosół
> Polidea  | Head of Cloud & OSS
>
> M: +48 606 630 236 <+48606630236>
> E: karolina.ro...@polidea.com
> [image: Polidea] 
>
> Check out our projects! 
> [image: Github]  [image: Facebook]
>  [image: Twitter]
>  [image: Linkedin]
>  [image: Instagram]
>  [image: Behance]
>  [image: dribbble]
> 
>


Re: [ANNOUNCEMENT] Support for Hadoop 3 confirmed

2020-11-18 Thread Pablo Estrada
Very nice. Thanks Piotr!

On Wed, Nov 18, 2020 at 7:42 AM Alexey Romanenko 
wrote:

> This is very good improvement, thank you for working on this!
>
> PS: What’s about Hive/HBase modules?
>
> > On 18 Nov 2020, at 11:54, Piotr Szuberski 
> wrote:
> >
> > Starting with [1] all of the Beam modules except for Hive/HBase are
> confirmed to support Hadoop 3.
> >
> > The change comes with 2.27.0 version.
> >
> > [1] https://github.com/apache/beam/pull/13352
>
>


Re: Unit Testing Custom Coder

2020-11-19 Thread Pablo Estrada
Hi Dave!
I don't have a lot of experience with coders, but I would include the Beam
user@ list (added just now) to see if someone else has done this.
Best
-P.

On Wed, Nov 18, 2020 at 7:22 AM Dave Anderson  wrote:

> Pablo,
>
> Also, for now I've created tests that exercise the encode() and decode()
> methods directly on the Coder. I was just curious if there was a way to
> also test it in the context of a pipeline.
>
> thanks,
>
> Dave
>
> *David Anderson*
>
> Solution Engineer - Cloud Healthcare & Life Sciences
>
> drander...@google.com
>
>
>
> On Wed, Nov 18, 2020 at 9:17 AM Dave Anderson 
> wrote:
>
>> Hi Pablo,
>>
>> I was wondering if you could point me to any resources that might help me
>> address a challenge I've run into when unit testing a Custom Coder I have
>> created as part of the Whistle ELP project.
>>
>> The regular tests where I'm using PAssert to check the value returning
>> from a call to
>> p.apply( "Create from test objects",
>>  Create.of(runtimeContext).withCoder(RuntimeContextCoder.of()));
>>
>> are working fine. The issue arises when I throw an exception from the
>> coder, for example when a null object is passed in to the encode() or
>> decode() methods.
>>
>> I've tried to use Assert.assertThrows() for the test case, and it is
>> actually succeeding in that check, but the pipeline continues to run and I
>> end up with a NPE.
>>
>> Outputs for non-root node Create from test objects are null
>> java.lang.NullPointerException: Outputs for non-root node Create from
>> test objects are null
>>
>> Have you ever written unit tests using the TestPipeline that are testing
>> for exceptions being thrown? Am I missing something really obvious?
>>
>> Also, feel free to direct me to a mailing list if there's an appropriate
>> one for this. You have just been incredibly helpful with the FhirIO,
>> HL7v2IO, and the DicomIO efforts, so I figured I'd ping you to get some
>> direction.
>>
>> thanks,
>>
>> Dave
>>
>> [image: image.png]
>>
>> *David Anderson*
>>
>> Solution Engineer - Cloud Healthcare & Life Sciences
>>
>> drander...@google.com
>>
>>
>>


Re: FileIO Azure Storage problems

2020-11-20 Thread Pablo Estrada
Hi Thomas!
This looks like it may be a bug with the azfs implementation. If you notice
the code, you're hitting this issue when the byte channel needs to seek
backwards.
I may take a stab at fixing it. I believe we have to catch a mark
expiration, and just reopen the file if that happens.

Do you think you could create a JIRA issue to track it and assign it to me?

On Thu, Nov 19, 2020 at 10:07 PM Thomas Li Fredriksen <
thomas.fredrik...@cognite.com> wrote:

> Good morning everyone.
>
>
> I am attempting to parse a very large CSV (65 million lines) with BEAM
> (version 2.25) from an Azure Blob and have created a pipeline for this. I
> am running the pipeline on dataflow and testing with a smaller version of
> the file (10'000 lines).
>
> I am using FileIO and the filesystem prefix "azfs" to read from azure
> blobs.
>
> The pipeline works with the small test file, but when I run this on the
> bigger file I am getting an exception "Stream Mark Expired" (pasted below).
> Reading the same file from a GCP bucket works just fine, including when
> running with dataflow.
>
> Our primary filestore is Azure Storage, so moving to GCP buckets is not an
> option.
>
> Is there anyone who can help me resolve this?
>
>
> Best Regards
> Thomas Li Fredriksen
>
>
> The exception referenced above:
>
>
> Error message from worker: java.lang.RuntimeException:
> org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException:
> Stream mark expired.
> org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(GroupAlsoByWindowsParDoFn.java:184)
>
> org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue(GroupAlsoByWindowFnRunner.java:108)
>
> org.apache.beam.runners.dataflow.worker.util.BatchGroupAlsoByWindowReshuffleFn.processElement(BatchGroupAlsoByWindowReshuffleFn.java:56)
>
> org.apache.beam.runners.dataflow.worker.util.BatchGroupAlsoByWindowReshuffleFn.processElement(BatchGroupAlsoByWindowReshuffleFn.java:39)
>
> org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.invokeProcessElement(GroupAlsoByWindowFnRunner.java:121)
>
> org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.processElement(GroupAlsoByWindowFnRunner.java:73)
>
> org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn.processElement(GroupAlsoByWindowsParDoFn.java:114)
>
> org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
>
> org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
>
> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
>
> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
>
> org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
>
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:417)
>
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:386)
>
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:311)
>
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140)
>
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120)
>
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107)
>
> java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>
> java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: org.apache.beam.sdk.util.UserCodeException:
> java.lang.RuntimeException: Stream mark expired.
> org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:36)
> org.apache.beam.sdk.io.ReadAllViaFileBasedSource$ReadFileRangesFn$DoFnInvoker.invokeProcessElement(Unknown
> Source)
> org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:227)
>
> org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:186)
>
> org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:335)
>
> org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
>
> org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
>
> org.apache.beam.runners.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:280)
>
> org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.S

Re: FileIO Azure Storage problems

2020-11-20 Thread Pablo Estrada
Follow up: What is the size of the file you're consuming?
-P.

On Fri, Nov 20, 2020 at 8:40 AM Pablo Estrada  wrote:

> Hi Thomas!
> This looks like it may be a bug with the azfs implementation. If you
> notice the code, you're hitting this issue when the byte channel needs to
> seek backwards.
> I may take a stab at fixing it. I believe we have to catch a mark
> expiration, and just reopen the file if that happens.
>
> Do you think you could create a JIRA issue to track it and assign it to me?
>
> On Thu, Nov 19, 2020 at 10:07 PM Thomas Li Fredriksen <
> thomas.fredrik...@cognite.com> wrote:
>
>> Good morning everyone.
>>
>>
>> I am attempting to parse a very large CSV (65 million lines) with BEAM
>> (version 2.25) from an Azure Blob and have created a pipeline for this. I
>> am running the pipeline on dataflow and testing with a smaller version of
>> the file (10'000 lines).
>>
>> I am using FileIO and the filesystem prefix "azfs" to read from azure
>> blobs.
>>
>> The pipeline works with the small test file, but when I run this on the
>> bigger file I am getting an exception "Stream Mark Expired" (pasted below).
>> Reading the same file from a GCP bucket works just fine, including when
>> running with dataflow.
>>
>> Our primary filestore is Azure Storage, so moving to GCP buckets is not
>> an option.
>>
>> Is there anyone who can help me resolve this?
>>
>>
>> Best Regards
>> Thomas Li Fredriksen
>>
>>
>> The exception referenced above:
>>
>>
>> Error message from worker: java.lang.RuntimeException:
>> org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException:
>> Stream mark expired.
>> org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(GroupAlsoByWindowsParDoFn.java:184)
>>
>> org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue(GroupAlsoByWindowFnRunner.java:108)
>>
>> org.apache.beam.runners.dataflow.worker.util.BatchGroupAlsoByWindowReshuffleFn.processElement(BatchGroupAlsoByWindowReshuffleFn.java:56)
>>
>> org.apache.beam.runners.dataflow.worker.util.BatchGroupAlsoByWindowReshuffleFn.processElement(BatchGroupAlsoByWindowReshuffleFn.java:39)
>>
>> org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.invokeProcessElement(GroupAlsoByWindowFnRunner.java:121)
>>
>> org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.processElement(GroupAlsoByWindowFnRunner.java:73)
>>
>> org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn.processElement(GroupAlsoByWindowsParDoFn.java:114)
>>
>> org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
>>
>> org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
>>
>> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
>>
>> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
>>
>> org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
>>
>> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:417)
>>
>> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:386)
>>
>> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:311)
>>
>> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140)
>>
>> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120)
>>
>> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107)
>>
>> java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>>
>> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>>
>> java.base/java.lang.Thread.run(Thread.java:834)
>> Caused by: org.apache.beam.sdk.util.UserCodeException:
>> java.lang.RuntimeException: Stream mark expired.
>> org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:36)
>>
>> org.apache.beam.sdk.io.ReadAllViaFileBasedSource$ReadFileRangesFn$DoF

Re: FileIO Azure Storage problems

2020-12-02 Thread Pablo Estrada
Hi Thomas!
This is my guess for a fix for this issue:
https://github.com/apache/beam/pull/13441
I have not tested it, so I am not sure that it will work. I wonder if you'd
be willing to try it? (perhaps I can build a JAR for it... or... I'm not
sure, what do you think?)
Best
-P.

On Fri, Nov 20, 2020 at 11:20 AM Thomas Fredriksen(External) <
thomas.fredrik...@cognite.com> wrote:

> Hi Pablo!
>
> The file I am trying to ingest is 54.2 GB unzipped. It is available here:
> https://obis.org/manual/access/
>
> I created a Jira issue for you, but seem unable to assign it:
> https://issues.apache.org/jira/browse/BEAM-11313
>
> Thank you so much for taking a look at this!
>
> Best Regards
> Thomas Li Fredriksen
>
> On Fri, Nov 20, 2020 at 5:43 PM Pablo Estrada  wrote:
>
>> Follow up: What is the size of the file you're consuming?
>> -P.
>>
>> On Fri, Nov 20, 2020 at 8:40 AM Pablo Estrada  wrote:
>>
>>> Hi Thomas!
>>> This looks like it may be a bug with the azfs implementation. If you
>>> notice the code, you're hitting this issue when the byte channel needs to
>>> seek backwards.
>>> I may take a stab at fixing it. I believe we have to catch a mark
>>> expiration, and just reopen the file if that happens.
>>>
>>> Do you think you could create a JIRA issue to track it and assign it to
>>> me?
>>>
>>> On Thu, Nov 19, 2020 at 10:07 PM Thomas Li Fredriksen <
>>> thomas.fredrik...@cognite.com> wrote:
>>>
>>>> Good morning everyone.
>>>>
>>>>
>>>> I am attempting to parse a very large CSV (65 million lines) with BEAM
>>>> (version 2.25) from an Azure Blob and have created a pipeline for this. I
>>>> am running the pipeline on dataflow and testing with a smaller version of
>>>> the file (10'000 lines).
>>>>
>>>> I am using FileIO and the filesystem prefix "azfs" to read from azure
>>>> blobs.
>>>>
>>>> The pipeline works with the small test file, but when I run this on the
>>>> bigger file I am getting an exception "Stream Mark Expired" (pasted below).
>>>> Reading the same file from a GCP bucket works just fine, including when
>>>> running with dataflow.
>>>>
>>>> Our primary filestore is Azure Storage, so moving to GCP buckets is not
>>>> an option.
>>>>
>>>> Is there anyone who can help me resolve this?
>>>>
>>>>
>>>> Best Regards
>>>> Thomas Li Fredriksen
>>>>
>>>>
>>>> The exception referenced above:
>>>>
>>>>
>>>> Error message from worker: java.lang.RuntimeException:
>>>> org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException:
>>>> Stream mark expired.
>>>> org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(GroupAlsoByWindowsParDoFn.java:184)
>>>>
>>>> org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue(GroupAlsoByWindowFnRunner.java:108)
>>>>
>>>> org.apache.beam.runners.dataflow.worker.util.BatchGroupAlsoByWindowReshuffleFn.processElement(BatchGroupAlsoByWindowReshuffleFn.java:56)
>>>>
>>>> org.apache.beam.runners.dataflow.worker.util.BatchGroupAlsoByWindowReshuffleFn.processElement(BatchGroupAlsoByWindowReshuffleFn.java:39)
>>>>
>>>> org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.invokeProcessElement(GroupAlsoByWindowFnRunner.java:121)
>>>>
>>>> org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.processElement(GroupAlsoByWindowFnRunner.java:73)
>>>>
>>>> org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn.processElement(GroupAlsoByWindowsParDoFn.java:114)
>>>>
>>>> org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
>>>>
>>>> org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
>>>>
>>>> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
>>>>
>>>> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
>>>>
>>>> org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
>>>>
>>>> org.apache.beam.runners.dataflow.w

[VOTE] Release 2.27.0, release candidate #1

2020-12-22 Thread Pablo Estrada
Hi everyone,
Please review and vote on the release candidate #1 for the version 2.27.0,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)


Reviewers are encouraged to test their own use cases with the release
candidate, and vote +1
 if no issues are found.

The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org [2],
which is signed with the key with fingerprint
C79DDD47DAF3808F0B9DDFAC02B2D9F742008494 [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "v2.27.0-RC1" [5],
* website pull request listing the release [6], publishing the API
reference manual [7], and the blog post [8].
* Python artifacts are deployed along with the source release to the
dist.apache.org [2].
* Validation sheet with a tab for 2.27.0 release to help with validation
[9].
* Docker images published to Docker Hub [10].

The vote will be open for at least 72 hours, but given the holidays, we
will likely extend for a few more days. The release will be adopted by
majority approval, with at least 3 PMC affirmative votes.

Thanks,
-P.

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12349380

[2] https://dist.apache.org/repos/dist/dev/beam/2.27.0/
[3] https://dist.apache.org/repos/dist/release/beam/KEYS
[4] https://repository.apache.org/content/repositories/orgapachebeam-1145/
[5] https://github.com/apache/beam/tree/v2.27.0-RC1
[6] https://github.com/apache/beam/pull/13602
[7] https://github.com/apache/beam-site/pull/610
[8] https://github.com/apache/beam/pull/13603
[9]
https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=194829106

[10] https://hub.docker.com/search?q=apache%2Fbeam&type=image


Re: [VOTE] Release 2.27.0, release candidate #1

2020-12-24 Thread Pablo Estrada
Alright! Thanks everyone for your validations. I'm cancelling this RC, and
I'll perform cherry picks to prepare the next one.

Please update this thread with any other cherry pick requests!
-P.

On Thu, Dec 24, 2020, 3:17 AM Ismaël Mejía  wrote:

> It might be a good idea to include also:
>
> [BEAM-11403] Cache UnboundedReader per UnboundedSourceRestriction in
> SDF Wrapper DoFn
> https://github.com/apache/beam/pull/13592
>
> So Java development experience is less affected (as with 2.26.0) (There
> is a flag to exclude but defaults matter).
>
> On Thu, Dec 24, 2020 at 2:56 AM Valentyn Tymofieiev 
> wrote:
> >
> > We discovered a regression on CombineFn.from_callable() started in
> 2.26.0. Even though it's not a regression in 2.27.0, I strongly prefer we
> fix it in 2.27.0 as it leads to buggy behavior, so I vote -1.
> >
> > The fix to release branch is in flight:
> https://github.com/apache/beam/pull/13613.
> >
> >
> >
> > On Wed, Dec 23, 2020 at 3:38 PM Brian Hulette 
> wrote:
> >>
> >> -1 (non-binding)
> >> Good news: I validated a dataframe pipeline on Dataflow which looked
> good (with expected performance improvements!)
> >> Bad news: I also tried to run the sql_taxi example pipeline (streaming
> SQL in python) on Dataflow and ran into PubSub IO related issues. The
> example fails in the same way with 2.26.0, but it works in 2.25.0. It's
> possible this is a Dataflow bug and not a Beam one, but I'd like to
> investigate further to make sure.
> >>
> >> On Wed, Dec 23, 2020 at 12:25 PM Kyle Weaver 
> wrote:
> >>>
> >>> +1 (non-binding) Validated wordcount with Python source + Flink and
> Spark job server jars. Also checked that the ...:sql:udf jar was added and
> includes our cherry-picks. Thanks Pablo :)
> >>>
> >>> On Wed, Dec 23, 2020 at 12:02 PM Ahmet Altay  wrote:
> >>>>
> >>>> +1 (binding).
> >>>>
> >>>> I validated python quickstarts. Thank you Pablo.
> >>>>
> >>>> On Tue, Dec 22, 2020 at 10:04 PM Jean-Baptiste Onofre <
> j...@nanthrax.net> wrote:
> >>>>>
> >>>>> +1 (binding)
> >>>>>
> >>>>> Regards
> >>>>> JB
> >>>>>
> >>>>> Le 23 déc. 2020 à 06:46, Pablo Estrada  a écrit
> :
> >>>>>
> >>>>> Hi everyone,
> >>>>> Please review and vote on the release candidate #1 for the version
> 2.27.0, as follows:
> >>>>> [ ] +1, Approve the release
> >>>>> [ ] -1, Do not approve the release (please provide specific comments)
> >>>>>
> >>>>>
> >>>>> Reviewers are encouraged to test their own use cases with the
> release candidate, and vote +1
> >>>>>  if no issues are found.
> >>>>>
> >>>>> The complete staging area is available for your review, which
> includes:
> >>>>> * JIRA release notes [1],
> >>>>> * the official Apache source release to be deployed to
> dist.apache.org [2], which is signed with the key with fingerprint
> C79DDD47DAF3808F0B9DDFAC02B2D9F742008494 [3],
> >>>>> * all artifacts to be deployed to the Maven Central Repository [4],
> >>>>> * source code tag "v2.27.0-RC1" [5],
> >>>>> * website pull request listing the release [6], publishing the API
> reference manual [7], and the blog post [8].
> >>>>> * Python artifacts are deployed along with the source release to the
> dist.apache.org [2].
> >>>>> * Validation sheet with a tab for 2.27.0 release to help with
> validation [9].
> >>>>> * Docker images published to Docker Hub [10].
> >>>>>
> >>>>> The vote will be open for at least 72 hours, but given the holidays,
> we will likely extend for a few more days. The release will be adopted by
> majority approval, with at least 3 PMC affirmative votes.
> >>>>>
> >>>>> Thanks,
> >>>>> -P.
> >>>>>
> >>>>> [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12349380
> >>>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.27.0/
> >>>>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> >>>>> [4]
> https://repository.apache.org/content/repositories/orgapachebeam-1145/
> >>>>> [5] https://github.com/apache/beam/tree/v2.27.0-RC1
> >>>>> [6] https://github.com/apache/beam/pull/13602
> >>>>> [7] https://github.com/apache/beam-site/pull/610
> >>>>> [8] https://github.com/apache/beam/pull/13603
> >>>>> [9]
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=194829106
> >>>>> [10] https://hub.docker.com/search?q=apache%2Fbeam&type=image
> >>>>>
> >>>>>
>


Re: [VOTE] Release 2.27.0, release candidate #1

2020-12-28 Thread Pablo Estrada
Good catch Ismael. Thanks!

I've created https://issues.apache.org/jira/browse/INFRA-21238 to request
the repositories to be created.

I am not sure what's the status of this work - should this block the
release until we can get those containers out?
cc: @Kyle Weaver 
Best
-P.

On Mon, Dec 28, 2020 at 9:48 AM Ismaël Mejía  wrote:

> It seems we are not publishing the latest versions of the Flink Job
> Server (Flink 1.11 and 1.12) docker images
>
> These do not exist:
>
> https://hub.docker.com/r/apache/beam_flink1.11_job_server
> https://hub.docker.com/r/apache/beam_flink1.12_job_server
>
> but this does and has the good 2.27.0_rc1 tag:
> https://hub.docker.com/r/apache/beam_flink1.10_job_server
>
> I wonder if the issue might be related to the fact that we need to
> request the repo to be created or if it is something different.
>
> On Thu, Dec 24, 2020 at 5:33 PM Brian Hulette  wrote:
> >
> > +Boyuan Zhang helped me get to the bottom of the sql_taxi issue. The
> problem is with the WriteStringsToPubSub API, which is deprecated since
> 2.7.0, but used in the example. Boyuan has [1] out to fix
> WriteStringsToPubSub and I just sent [2] to replace WriteStringsToPubSub
> with WriteToPubSub in example code. Issue is tracked in [3].
> >
> > [1] https://github.com/apache/beam/pull/13614
> > [2] https://github.com/apache/beam/pull/13615
> > [3] https://issues.apache.org/jira/browse/BEAM-11524
> >
> > On Thu, Dec 24, 2020 at 8:26 AM Pablo Estrada 
> wrote:
> >>
> >> Alright! Thanks everyone for your validations. I'm cancelling this RC,
> and I'll perform cherry picks to prepare the next one.
> >>
> >> Please update this thread with any other cherry pick requests!
> >> -P.
> >>
> >> On Thu, Dec 24, 2020, 3:17 AM Ismaël Mejía  wrote:
> >>>
> >>> It might be a good idea to include also:
> >>>
> >>> [BEAM-11403] Cache UnboundedReader per UnboundedSourceRestriction in
> >>> SDF Wrapper DoFn
> >>> https://github.com/apache/beam/pull/13592
> >>>
> >>> So Java development experience is less affected (as with 2.26.0) (There
> >>> is a flag to exclude but defaults matter).
> >>>
> >>> On Thu, Dec 24, 2020 at 2:56 AM Valentyn Tymofieiev <
> valen...@google.com> wrote:
> >>> >
> >>> > We discovered a regression on CombineFn.from_callable() started in
> 2.26.0. Even though it's not a regression in 2.27.0, I strongly prefer we
> fix it in 2.27.0 as it leads to buggy behavior, so I vote -1.
> >>> >
> >>> > The fix to release branch is in flight:
> https://github.com/apache/beam/pull/13613.
> >>> >
> >>> >
> >>> >
> >>> > On Wed, Dec 23, 2020 at 3:38 PM Brian Hulette 
> wrote:
> >>> >>
> >>> >> -1 (non-binding)
> >>> >> Good news: I validated a dataframe pipeline on Dataflow which
> looked good (with expected performance improvements!)
> >>> >> Bad news: I also tried to run the sql_taxi example pipeline
> (streaming SQL in python) on Dataflow and ran into PubSub IO related
> issues. The example fails in the same way with 2.26.0, but it works in
> 2.25.0. It's possible this is a Dataflow bug and not a Beam one, but I'd
> like to investigate further to make sure.
> >>> >>
> >>> >> On Wed, Dec 23, 2020 at 12:25 PM Kyle Weaver 
> wrote:
> >>> >>>
> >>> >>> +1 (non-binding) Validated wordcount with Python source + Flink
> and Spark job server jars. Also checked that the ...:sql:udf jar was added
> and includes our cherry-picks. Thanks Pablo :)
> >>> >>>
> >>> >>> On Wed, Dec 23, 2020 at 12:02 PM Ahmet Altay 
> wrote:
> >>> >>>>
> >>> >>>> +1 (binding).
> >>> >>>>
> >>> >>>> I validated python quickstarts. Thank you Pablo.
> >>> >>>>
> >>> >>>> On Tue, Dec 22, 2020 at 10:04 PM Jean-Baptiste Onofre <
> j...@nanthrax.net> wrote:
> >>> >>>>>
> >>> >>>>> +1 (binding)
> >>> >>>>>
> >>> >>>>> Regards
> >>> >>>>> JB
> >>> >>>>>
> >>> >>>>> Le 23 déc. 2020 à 06:46, Pablo Estrada  a
> écrit :
> >>> >>>>>
> >>> >>>>> Hi everyone,
> >>> >>>>

[VOTE] Release 2.27.0, release candidate #4

2021-01-05 Thread Pablo Estrada
Hi everyone,
Please review and vote on the release candidate #4 for the version 2.27.0,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

*NOTE*. What happened to RC #2? I started building RC2 before completing
all the cherry-picks, so the tag for RC2 was created on an incorrect commit.

*NOTE*. What happened to RC #3? I started building RC3, but a new bug was
discovered (BEAM-11569) that required amending the branch. Thus this is now
RC4.

Reviewers are encouraged to test their own use cases with the release
candidate, and vote +1
 if no issues are found.

The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org [2],
which is signed with the key with fingerprint
C79DDD47DAF3808F0B9DDFAC02B2D9F742008494 [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "v2.27.0-RC4" [5],
* website pull request listing the release [6], publishing the API
reference manual [7], and the blog post [8].
* Python artifacts are deployed along with the source release to the
dist.apache.org [2].
* Validation sheet with a tab for 2.27.0 release to help with validation
[9].
* Docker images published to Docker Hub [10].

The vote will be open for at least 72 hours, but given the holidays, we
will likely extend for a few more days. The release will be adopted by
majority approval, with at least 3 PMC affirmative votes.

Thanks,
-P.

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12349380

[2] https://dist.apache.org/repos/dist/dev/beam/2.27.0/
[3] https://dist.apache.org/repos/dist/release/beam/KEYS
[4] https://repository.apache.org/content/repositories/orgapachebeam-1149/
[5] https://github.com/apache/beam/tree/v2.27.0-RC4
[6] https://github.com/apache/beam/pull/13602
[7] https://github.com/apache/beam-site/pull/610
[8] https://github.com/apache/beam/pull/13603
[9]
https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=194829106

[10] https://hub.docker.com/search?q=apache%2Fbeam&type=image


Re: [VOTE] Release 2.27.0, release candidate #4

2021-01-06 Thread Pablo Estrada
+1 (binding)
I've built and unit tested existing Dataflow Templates with the new version.
Best
-P.

On Tue, Jan 5, 2021 at 11:17 PM Pablo Estrada  wrote:

> Hi everyone,
> Please review and vote on the release candidate #4 for the version 2.27.0,
> as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> *NOTE*. What happened to RC #2? I started building RC2 before completing
> all the cherry-picks, so the tag for RC2 was created on an incorrect commit.
>
> *NOTE*. What happened to RC #3? I started building RC3, but a new bug was
> discovered (BEAM-11569) that required amending the branch. Thus this is now
> RC4.
>
> Reviewers are encouraged to test their own use cases with the release
> candidate, and vote +1
>  if no issues are found.
>
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release to be deployed to dist.apache.org [2],
> which is signed with the key with fingerprint
> C79DDD47DAF3808F0B9DDFAC02B2D9F742008494 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v2.27.0-RC4" [5],
> * website pull request listing the release [6], publishing the API
> reference manual [7], and the blog post [8].
> * Python artifacts are deployed along with the source release to the
> dist.apache.org [2].
> * Validation sheet with a tab for 2.27.0 release to help with validation
> [9].
> * Docker images published to Docker Hub [10].
>
> The vote will be open for at least 72 hours, but given the holidays, we
> will likely extend for a few more days. The release will be adopted by
> majority approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> -P.
>
> [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12349380
>
> [2] https://dist.apache.org/repos/dist/dev/beam/2.27.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4] https://repository.apache.org/content/repositories/orgapachebeam-1149/
>
> [5] https://github.com/apache/beam/tree/v2.27.0-RC4
> [6] https://github.com/apache/beam/pull/13602
> [7] https://github.com/apache/beam-site/pull/610
> [8] https://github.com/apache/beam/pull/13603
> [9]
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=194829106
>
> [10] https://hub.docker.com/search?q=apache%2Fbeam&type=image
>


[ANNOUNCE] Beam 2.27.0 Released

2021-01-08 Thread Pablo Estrada
The Apache Beam team is pleased to announce the release of version 2.27.0.

Apache Beam is an open source unified programming model to define and
execute data processing pipelines, including ETL, batch and stream
(continuous) processing.
See https://beam.apache.org

You can download the release here:
https://beam.apache.org/get-started/downloads/

This release includes bug fixes, features, and improvements detailed on the
Beam blog: https://beam.apache.org/blog/beam-2.27.0/

Thank you to everyone who contributed to this release, and we hope you
enjoy using Beam 2.27.0

-Pablo, on behalf of the Apache Beam community.


Re: Python SDK with S3IO on Flink

2021-02-25 Thread Pablo Estrada
hi Nir!
was this fixed by the PR you submitted?

On Wed, Feb 24, 2021 at 8:55 AM Nir Gazit  wrote:

> Hey,
> When trying to read a file from S3 with a combine action, the pipeline
> seems to be stuck. When replacing it with a GCP source it works fine.
> Furthermore - if I comment out the Count.PerElement part it also works.
>
> Anyone has an idea why that is?
>
>  lines = p | beam.io.ReadFromText('s3://...')
> transformed = (
> lines
> | 'SplitLine' >>
> (beam.ParDo(WordExtractingDoFn()).with_output_types(unicode))
> | 'Count' >> beam.combiners.Count.PerElement()
> | 'Format' >> beam.MapTuple(lambda w, c: f'{w}: {c}')
> )
>
> transformed | 'Write' >> beam.io.WriteToText('s3://...')
>
> Thanks!
> Nir
>


Re: Dataflow v2 runner scaling behaviour

2021-03-24 Thread Pablo Estrada
Hi David,
Thanks for sharing. I'm investigating something like this recently. What's
the size of your data?
Best
-P.

On Wed, Mar 24, 2021, 7:52 AM David Sánchez  wrote:

> Hi folks!
>
> I'm testing the dataflow v2 runner in a batch pipeline (Apache Beam Python
> 3.7 SDK 2.27.0) that reads many million of rows from BigQuery and writes to
> PubSub and BigQuery using the flag "--experiments=use_runner_v2".
>
> The same job used to scale up immediately to over 50 workers, but in v2 it
> never scales up further than 5-6 workers, thus it's way slower. I can see
> however that the total vCPU and memory are about half than before, which is
> promising. Any clue about why the scaling is behaving differently?
>
> Many thanks
>


Re: Rate Limiting in Beam

2021-04-15 Thread Pablo Estrada
You could implement a Splittable DoFn that generates a limited number of
splits. We do something like this for
GenerateSequence.from(X).withRate(...) via UnboundedCountingSource[1]. It
keeps track of its local EPS, and generates new splits if more EPSs are
wanted. This should help you scale up to the maximum of EPS that you want,
and autoscaling will only produce the appropriate number of workers for
that number of splits.

- The only issue may be that you can't "scale down" if you find that some
of your splits have a very low throughput, because two splits can't be
merged back together (does that make sense?) - but Dataflow should be able
to scale down and schedule multiple splits in a single worker if that's the
case.

The UnboundedCountingSource is a Source, so it can't have an input (and
it's deprecated), but you could write a SplittableDoFn that has the same
behavior. Do you think this could work?


[1]
https://github.com/apache/beam/blob/8c9605f224115507912cf72e02d3fa94905548ec/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CountingSource.java#L334-L348

On Thu, Apr 15, 2021 at 4:11 PM Evan Galpin  wrote:

> Could you possibly use a side input with fixed interval triggering[1] to
> query the Dataflow API to get the most recent log statement of scaling as
> suggested here[2]?
>
> [1]
> https://beam.apache.org/documentation/patterns/side-inputs/
> [2]
> https://stackoverflow.com/a/54406878/6432284
>
> On Thu, Apr 15, 2021 at 18:14 Daniel Thevessen  wrote:
>
>> Hi folks,
>>
>> I've been working on a custom PTransform that makes requests to another
>> service, and would like to add a rate limiting feature there. The
>> fundamental issue that I'm running into here is that I need a decent
>> heuristic to estimate the worker count, so that each worker can
>> independently set a limit which globally comes out to the right value. All
>> of this is easy if I know how many machines I have, but I'd like to use
>> Dataflow's autoscaling, which would easily break any pre-configured value.
>> I have seen two main approaches for rate limiting, both for a
>> configurable variable x:
>>
>>- Simply assume worker count is x, then divide by x to figure out the
>>"local" limit. The issue I have here is that if we assume x is 500, but it
>>is actually 50, I'm now paying for 50 nodes to throttle 10 times as much 
>> as
>>necessary. I know the pipeline options have a reference to the runner, is
>>it possible to get an approximate current worker count from that at bundle
>>start (*if* runner is DataflowRunner)?
>>- Add another PTransform in front of the API requests, which groups
>>by x number of keys, throttles, and keeps forwarding elements with an
>>instant trigger. I initially really liked this solution because even if x
>>is misconfigured, I will have at most x workers running and throttle
>>appropriately. However, I noticed that for batch pipelines, this
>>effectively also caps the API request stage at x workers. If I throw in a
>>`Reshuffle`, there is another GroupByKey (-> another stage), and nothing
>>gets done until every element has passed through the throttler.
>>
>> Has anyone here tried to figure out rate limiting with Beam before, and
>> perhaps run into similar issues? I would love to know if there is a
>> preferred solution to this type of problem.
>> I know sharing state like that runs a little counter to the Beam pipeline
>> paradigm, but really all I need is an approximate worker count with few
>> guarantees.
>>
>> Cheers,
>> Daniel
>>
>


Re: Testing of Metrics in context of DoFnTester

2017-05-09 Thread Pablo Estrada
Hi Michael,
For the Metrics API, the way to programatically query the value of a metric
is by using the MetricResults.queryMetrics method. You get the
MetricResults object from the PipelineResult object, and query it like so:

PipelineResult res = p.run();
MetricQueryResults metricResult = res.metrics().queryMetrics();

The queryMetrics method takes in a MetricsFilter instance.

Not all runners support this operation. For the dataflow runner, PR 2896[1]
should add it.

Let me know if you need more help with this.
Best
-P.

[1] - https://github.com/apache/beam/pull/2896

On Tue, May 9, 2017 at 3:48 PM Michael Luckey  wrote:

> Hi,
>
> currently we are evaluating a migration from 0.6.0 to current. We
> encountered the following problem, which we currently not sure, how to best
> solve.
>
> Say we have a DoFn, which is using Aggregators, e.g.
>
> ctr = createAggregator("someCounter", Sum.ofLongs());
>
>  We were testing them with DoFn-Tester like
>
> final DoFnTester doFnTester =
> DoFnTester.of(fixture);
> doFnTester.processElement(input);
>
>
>  assertThat(doFnTester.getAggregatorValue(fixture.ctr).longValue(), is(1L));
>
> As aggregators are removed now from the codebase, we are considering using
> Metrics instead. But we did not find an equivalent to the
> getAggregatorValue method on DoFnTester.
>
> Any suggestion how we could keep our counters tested within a unit test
> based on DoFnTester? Or do we have to find a completely different solution?
> Are we doing something completely wrong trying to test correct workings of
> our counters with this approach?
>
> Regards,
>
> michel
>
>
>


Re: Testing of Metrics in context of DoFnTester

2017-05-09 Thread Pablo Estrada
Hi Michael,
I'm sorry. I see I did not read your first email properly.

There are a couple places in the core SDK or runner code and tests that
used to use aggregators, and now use metrics. There are two reasonable
options for this:
1. In [1], the test sets up the metrics global environment by setting the
current container (e.g.  MetricsEnvironment.setCurrentContainer(new
MetricsContainer("anystep"));), and the LateDataFilter uses metrics
normally[2], by creating a counter that relies on the environment set up in
the test.

2. If you'd rather not rely on setting up a global environment, you can use
CounterCell, and pass it in to your test. In [3] it's not a test, but a
CounterCell is still created to keep internal statistics, and later its
value is checked [4]. As a note, you may note in [3] that CounterCells are
a bit quirky to create, as we did not intend for external users to be able
to create them.

Let me know if these suggestions are helpful.
Best
-P.

[1] -
https://github.com/apache/beam/blob/master/runners/core-java/src/test/java/org/apache/beam/runners/core/LateDataDroppingDoFnRunnerTest.java#L61

[2] -
https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/LateDataDroppingDoFnRunner.java#L96
[3] -
https://github.com/apache/beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/stateful/SparkGroupAlsoByWindowViaWindowSet.java#L210
[4] -
https://github.com/apache/beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/stateful/SparkGroupAlsoByWindowViaWindowSet.java#L326

On Tue, May 9, 2017 at 4:33 PM Michael Luckey  wrote:

> Hi Pablo,
>
> thanks for your help! We certainly could change our testing code and
> involve execution of a pipeline during tests.
>
> But currently we are leveraging DoFnTester, i.e. scoping our tests to the
> DoFn only, which means, there is neither a pipeline nor a pipeline result
> involved, which i could call upon.
>
> It might be a bad idea trying to test counters on this basis, but as it
> was supported previously i thought we might have overlooked an API for
> accessing these metrics somehow within DoFnTesting. Not sure, wether it
> makes sense for the DoFnTester to somehow provide Metrics-Support to enable
> this kind of tests. I certainly do not like the idea to much starting to do
> some mocking of the metrics api within my test implementation.
>
> Regards,
>
> michel
>
>
> On Wed, May 10, 2017 at 1:10 AM, Pablo Estrada  wrote:
>
>> Hi Michael,
>> For the Metrics API, the way to programatically query the value of a
>> metric is by using the MetricResults.queryMetrics method. You get the
>> MetricResults object from the PipelineResult object, and query it like so:
>>
>> PipelineResult res = p.run();
>> MetricQueryResults metricResult = res.metrics().queryMetrics();
>>
>> The queryMetrics method takes in a MetricsFilter instance.
>>
>> Not all runners support this operation. For the dataflow runner, PR
>> 2896[1] should add it.
>>
>> Let me know if you need more help with this.
>> Best
>> -P.
>>
>> [1] - https://github.com/apache/beam/pull/2896
>>
>> On Tue, May 9, 2017 at 3:48 PM Michael Luckey 
>> wrote:
>>
>>> Hi,
>>>
>>> currently we are evaluating a migration from 0.6.0 to current. We
>>> encountered the following problem, which we currently not sure, how to best
>>> solve.
>>>
>>> Say we have a DoFn, which is using Aggregators, e.g.
>>>
>>> ctr = createAggregator("someCounter", Sum.ofLongs());
>>>
>>>  We were testing them with DoFn-Tester like
>>>
>>> final DoFnTester doFnTester =
>>> DoFnTester.of(fixture);
>>> doFnTester.processElement(input);
>>>
>>>
>>>  assertThat(doFnTester.getAggregatorValue(fixture.ctr).longValue(), is(1L));
>>>
>>> As aggregators are removed now from the codebase, we are considering
>>> using Metrics instead. But we did not find an equivalent to the
>>> getAggregatorValue method on DoFnTester.
>>>
>>> Any suggestion how we could keep our counters tested within a unit test
>>> based on DoFnTester? Or do we have to find a completely different solution?
>>> Are we doing something completely wrong trying to test correct workings of
>>> our counters with this approach?
>>>
>>> Regards,
>>>
>>> michel
>>>
>>>
>>>
>


Re: Data guarantees PubSub to GCS

2018-01-04 Thread Pablo Estrada
I am not a streaming expert, but I will answer according to how I
understand the system, and others can correct me if I get something wrong.

*Regarding elements coming from PubSub into your pipeline:*
Once the data enters your pipeline, it is 'acknowledged' on your PubSub
subscription, and you won't be able to retrieve it again from PubSub on the
same subscription.

*Regarding elements stuck within your pipeline:*
Bundles in a streaming pipeline are executed and committed individually.
This means that one bundle may be stuck, while all other bundles may be
moving forward in your pipeline. In a case like this, you won't be able to
drain the pipeline because there is one bundle that can't be drained out
(because exceptions are thrown every time processing for it is attempted).
On the other hand, if you cancel your pipeline, then the information
regarding the progress made by each bundle will be lost, so you will drop
the data that was stuck within your pipeline, and was never written out.
(That data was also acked in your PubSub subscription, so it won't come out
from PubSub if you reattach to the same subscription later). - So cancel
may not be what you're looking for either.

For cases like these, what you'd need to do is to live-update your pipeline
with code that can handle the problems in your current pipeline. This new
code will replace the code in your pipeline stages, and then Dataflow will
continue processing of your data in the state that it was before the
update. This means that if there's one bundle that was stuck, it will be
retried against the new code, and it will finally make progress across your
pipeline.

If you want to completely change, or stop your pipeline without dropping
stuck bundles, you will still need to live-update it, and then drain it.

I hope that was clear. Let me know if you need more clarification - and
perhaps others will have more to add / correct.
Best!
-P.

On Wed, Jan 3, 2018 at 3:09 AM Andrew Jones 
wrote:

> Hi,
>
> I'd like to confirm Beams data guarantees when used with Google Cloud
> PubSub and Cloud Storage and running on Dataflow. I can't find any explicit
> documentation on it.
>
> If the Beam job is running successfully, then I believe all data will be
> delivered to GCS at least once. If I stop the job with 'Drain', then any
> inflight data will be processed and saved.
>
> What happens if the Beam job is not running successfully, and maybe
> throwing exceptions? Will the data still be available in PubSub when I
> cancel (not drain) the job? Does a drain work successfully if the data
> cannot be written to GCS because of the exceptions?
>
> Thanks,
> Andrew
>


Re: Data guarantees PubSub to GCS

2018-01-04 Thread Pablo Estrada
Oops - well, sorry about that! Glad Luke was able to clarify.
Best.
-P.

On Thu, Jan 4, 2018, 1:20 PM Lukasz Cwik  wrote:

> That is correct Derek, Google Cloud Dataflow will only ack the message to
> Pubsub when a bundle completes.
> For very simple pipelines fusion will make it so that all downstream
> actions will happen within the same bundle.
>
> On Thu, Jan 4, 2018 at 1:00 PM, Derek Hao Hu 
> wrote:
>
>> Hi Pablo,
>>
>> *Regarding elements coming from PubSub into your pipeline:*
>> Once the data enters your pipeline, it is 'acknowledged' on your PubSub
>> subscription, and you won't be able to retrieve it again from PubSub on the
>> same subscription.
>>
>> This part differs from my understanding of consuming Pub/Sub messages in
>> the Dataflow pipeline. I think the message will only be committed when a
>> PCollection in the pipeline gets materialized (
>> https://stackoverflow.com/questions/41727106/when-does-dataflow-acknowledge-a-message-of-batched-items-from-pubsubio),
>> which means if the pipeline is not complicated. Fusion optimization would
>> fuse multiple stages together and if any of these stages throw an
>> exception, the Pub/Sub message won't be acknowledged. I've also verified
>> this behavior.
>>
>> Let me know if my understanding is correct. :)
>>
>> Thanks,
>>
>> Derek
>>
>>
>> On Thu, Jan 4, 2018 at 11:42 AM, Pablo Estrada 
>> wrote:
>>
>>> I am not a streaming expert, but I will answer according to how I
>>> understand the system, and others can correct me if I get something wrong.
>>>
>>> *Regarding elements coming from PubSub into your pipeline:*
>>> Once the data enters your pipeline, it is 'acknowledged' on your PubSub
>>> subscription, and you won't be able to retrieve it again from PubSub on the
>>> same subscription.
>>>
>>> *Regarding elements stuck within your pipeline:*
>>> Bundles in a streaming pipeline are executed and committed individually.
>>> This means that one bundle may be stuck, while all other bundles may be
>>> moving forward in your pipeline. In a case like this, you won't be able to
>>> drain the pipeline because there is one bundle that can't be drained out
>>> (because exceptions are thrown every time processing for it is attempted).
>>> On the other hand, if you cancel your pipeline, then the information
>>> regarding the progress made by each bundle will be lost, so you will drop
>>> the data that was stuck within your pipeline, and was never written out.
>>> (That data was also acked in your PubSub subscription, so it won't come out
>>> from PubSub if you reattach to the same subscription later). - So cancel
>>> may not be what you're looking for either.
>>>
>>> For cases like these, what you'd need to do is to live-update your
>>> pipeline with code that can handle the problems in your current pipeline.
>>> This new code will replace the code in your pipeline stages, and then
>>> Dataflow will continue processing of your data in the state that it was
>>> before the update. This means that if there's one bundle that was stuck, it
>>> will be retried against the new code, and it will finally make progress
>>> across your pipeline.
>>>
>>> If you want to completely change, or stop your pipeline without dropping
>>> stuck bundles, you will still need to live-update it, and then drain it.
>>>
>>> I hope that was clear. Let me know if you need more clarification - and
>>> perhaps others will have more to add / correct.
>>> Best!
>>> -P.
>>>
>>> On Wed, Jan 3, 2018 at 3:09 AM Andrew Jones <
>>> andrew+b...@andrew-jones.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'd like to confirm Beams data guarantees when used with Google Cloud
>>>> PubSub and Cloud Storage and running on Dataflow. I can't find any explicit
>>>> documentation on it.
>>>>
>>>> If the Beam job is running successfully, then I believe all data will
>>>> be delivered to GCS at least once. If I stop the job with 'Drain', then any
>>>> inflight data will be processed and saved.
>>>>
>>>> What happens if the Beam job is not running successfully, and maybe
>>>> throwing exceptions? Will the data still be available in PubSub when I
>>>> cancel (not drain) the job? Does a drain work successfully if the data
>>>> cannot be written to GCS because of the exceptions?
>>>>
>>>> Thanks,
>>>> Andrew
>>>>
>>>
>>
>>
>> --
>> Derek Hao Hu
>>
>> Software Engineer | Snapchat
>> Snap Inc.
>>
>
>


Re: Advice on parallelizing network calls in DoFn

2018-03-13 Thread Pablo Estrada
I'd just like to close the loop. Josh, did you get an answer/guidance on
how to proceed with your pipeline?
Or maybe we'll need a new thread to figure that out : )
Best
-P.

On Fri, Mar 9, 2018 at 1:39 PM Josh Ferge 
wrote:

> Hello all:
>
> Our team has a pipeline that make external network calls. These pipelines
> are currently super slow, and the hypothesis is that they are slow because
> we are not threading for our network calls. The github issue below provides
> some discussion around this:
>
> https://github.com/apache/beam/pull/957
>
> In beam 1.0, there was IntraBundleParallelization, which helped with this.
> However, this was removed because it didn't comply with a few BEAM
> paradigms.
>
> Questions going forward:
>
> What is advised for jobs that make blocking network calls? It seems
> bundling the elements into groups of size X prior to passing to the DoFn,
> and managing the threading within the function might work. thoughts?
> Are these types of jobs even suitable for beam?
> Are there any plans to develop features that help with this?
>
> Thanks
>
-- 
Got feedback? go/pabloem-feedback


Re: [Please take it] Apache Beam Summit - Exit Survey

2018-03-28 Thread Pablo Estrada
Including user list.

-- Forwarded message -
From: Pablo Estrada 
Date: Wed, Mar 28, 2018 at 1:15 PM
Subject: [Please take it] Apache Beam Summit - Exit Survey
To: d...@beam.apache.org 


Hello all,
For those who attended the summit (or for those who did not attend but
would like to attend subsequent ones, or share comments about it), please
take a few minutes of your time to answer this short survey:

https://goo.gl/forms/SqRXgFJuN3FPFp002

The purpose of the survey is to try to understand what was useful about the
summit, and what should be improved in the future. I'll share results of
the survey with this list in a few weeks, after a number of people have
taken it.

Thanks all!
-P.
-- 
Got feedback? go/pabloem-feedback


Re: BigQuery streaming insert errors

2018-04-05 Thread Pablo Estrada
Im adding Cham as he might be knowledgeable about BQ IO, or he might be
able to redirect to someone else.
Cham, do you have guidance for Carlos here?
Thanks
-P.

On Mon, Apr 2, 2018 at 11:08 AM Carlos Alonso  wrote:

> And... where could I catch that exception?
>
> Thanks!
> On Mon, 2 Apr 2018 at 16:58, Ted Yu  wrote:
>
>> Wouldn't the following code give you information about failed insertions
>> (around line 790 in BigQueryServicesImpl) ?
>>
>>   if (!allErrors.isEmpty()) {
>> throw new IOException("Insert failed: " + allErrors);
>>
>> Cheers
>>
>> On Mon, Apr 2, 2018 at 7:16 AM, Carlos Alonso 
>> wrote:
>>
>>> Hi everyone!!
>>>
>>> I was wondering if there's any way to get the error why an insert
>>> (streaming) failed. Looking at the code I think there's currently no way to
>>> do that, as the BigQueryServicesImpl insertAll seems to discard the errors
>>> and just add the failed TableRow instances into the failedInserts list.
>>>
>>> It would be very nice to have an "enriched" TableRow returned instead
>>> that contains the error information for further processing (in our use case
>>> we're saving the failed ones into a different table for further analysis)
>>>
>>> Could this be added as an enhancement or similar Issue in GH/Jira? Any
>>> other ideas?
>>>
>>> Thanks!
>>>
>>
>> --
Got feedback? go/pabloem-feedback


Re: Testing an updating side input on global window

2018-05-29 Thread Pablo Estrada
As far as I know, that behavior is not specified. I do not think that
Dataflow streaming supports this sort of updating to side inputs, though
I've added Slava who might have more to add.

If updating side inputs is really not supported in Dataflow, you may be
able to use a LoadingCache, like so:
https://lists.apache.org/thread.html/%3cb1660eab-aec8-4635-8386-8353685db...@gameduell.de%3E

Best
-P.

On Tue, May 29, 2018 at 2:36 PM Carlos Alonso  wrote:

> Hi Lukasz, many thanks for your responses.
>
> I'm actually using them but I think I'm not being able to synchronise the
> following steps:
> 1: The side input gets its first value (v1)
> 2: The main stream gets that side input applied and finds that v1 value
> 3: The side one gets updated (v2)
> 4: The main stream gets the side input applied again and finds the v2
> value (along with v1 as this is multimap)
>
> Regards
>
> On Tue, May 29, 2018 at 10:57 PM Lukasz Cwik  wrote:
>
>> Your best bet is to use TestStreams[1] as it is used to validate
>> window/triggering behavior. Note that the transform requires special runner
>> based execution and currently only works with the DirectRunner. All
>> examples are marked with the JUnit category "UsesTestStream", for example
>> [2].
>>
>> 1:
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/TestStream.java
>> 2:
>> https://github.com/apache/beam/blob/0cbcf4ad1db7d820c5476d636f3a3d69062021a5/sdks/java/core/src/test/java/org/apache/beam/sdk/testing/TestStreamTest.java#L69
>>
>>
>> On Tue, May 29, 2018 at 1:05 PM Carlos Alonso 
>> wrote:
>>
>>> Hi all!!
>>>
>>> Basically that's what I'm trying to do. I'm building a pipeline that has
>>> a refreshing, multimap, side input (BQ schemas) that then I apply to the
>>> main stream of data (records that are ultimately saved to the corresponding
>>> BQ table).
>>>
>>> My job, although being of streaming nature, runs on the global window,
>>> and I want to unit test that the side input refreshes and that the updates
>>> are successfully applied.
>>>
>>> I'm using scio and I can't seem to simulate that refreshing behaviour.
>>> These are the relevant bits of the code:
>>> https://gist.github.com/calonso/87d392a65079a66db75a78cb8d80ea98
>>>
>>> The way I see understand it, the side collection is refreshed before
>>> accessing it so when accessed, it already contains the final (updated)
>>> snapshot of the schemas, is that true? In which case, how can I simulate
>>> that synchronisation? I'm using processing times as I thought that could be
>>> the way to go, but obviously something is wrong there.
>>>
>>> Many thanks!!
>>>
>> --
Got feedback? go/pabloem-feedback


Re: Initial contributor experience

2018-06-05 Thread Pablo Estrada
Thanks Austin for taking the time to go through this! We came out with a
few JIRAs to improve the documentation (see doc), and hopefully we'll keep
iterating on this.

Hopefully we can get more experiences from other people that start to
approach Beam.

Best
-P.

On Tue, Jun 5, 2018 at 1:49 PM Griselda Cuevas  wrote:

> +user@ in case someone has had similar experiences.
>
> Thanks for documenting this Austin & Pablo!
>
> If any other folks would like to participate in improving the "First
> contribution experience" for Beam let us know in this thread.
>
> On Tue, 5 Jun 2018 at 13:40, Austin Bennett 
> wrote:
>
>> Had been meaning to get setup to contribute for a bit - so walked through
>> with Pablo, to try to point out assumptions, lacking docs, etc.
>>
>> Write-up found-->
>>
>>
>> https://docs.google.com/document/d/1hq-s3L676LkMTftvhv0eCkdwrRnZmCRiLdaQBWLHWWA/edit
>>
> --
Got feedback? go/pabloem-feedback


Re: [DISCUSS] [BEAM-4126] Deleting Maven build files (pom.xml) grace period?

2018-06-06 Thread Pablo Estrada
I agree that we should delete the pom.xml files soon, as they create a
burden for maintainers.

I'd like to be able to extend the grace period by a bit, to allow the
internal build systems at Google to move away from using the Beam poms.

We use these pom files to build Dataflow workers, and thus it's critical
for us that they are available for a few more weeks while we set up a
gradle build. Perhaps 4 weeks?
(Calling out+Chamikara Jayalath  who has recently
worked on internal Dataflow tooling.)

Best
-P.

On Wed, Jun 6, 2018 at 1:05 PM Lukasz Cwik  wrote:

> Note: Apache Beam will still provide pom.xml for each release it produces.
> This is only about people using Maven to build Apache Beam themselves and
> not relying on the released artifacts in Maven Central.
>
> With the first release using Gradle as the build system is underway, I
> wanted to start this thread to remind people that we are going to delete
> the Maven pom.xml files after the 2.5.0 release is finalized plus a two
> week grace period.
>
> Are there others who would like a shorter/longer grace period?
>
> The PR to delete the pom.xml is here:
> https://github.com/apache/beam/pull/5571
>
-- 
Got feedback? go/pabloem-feedback


Re: [FYI] New Apache Beam Swag Store!

2018-06-08 Thread Pablo Estrada
Nice : D

On Fri, Jun 8, 2018, 3:43 PM Raghu Angadi  wrote:

> Woo-hoo! This is terrific.
>
> If we are increasing color choices I would like black or charcoal... Beam
> logo would really pop on a dark background.
>
> On Fri, Jun 8, 2018 at 3:32 PM Griselda Cuevas  wrote:
>
>> Hi Beam Community,
>>
>> I just want to share with you the exciting news about our brand new
>> Apache Beam Swag Store!
>>
>> You can find it here: https://store-beam.myshopify.com/
>>
>> *How does it work?*
>>
>>- You can just select the items you want and check-out. Our Vendor
>>ships to anywhere in the world and normally can have swag to be delivered
>>within 1 week. Each company or user will need to pay for their own swag.
>>- If you are hosting an event or representing Beam at one, reach out
>>to me or the beam-events-meetups slack channel, I'll be happy to review
>>your event and see if we can sponsor the swag. We'll have codes for this
>>occasions thanks to Google, who has sponsored an initial inventory.
>>
>> If you have feedback, ideas on new swag, questions or suggestions, reach
>> out to me and/or Matthias Baetens.
>>
>> Happy Friday!
>> G
>>
>>
>> --
Got feedback? go/pabloem-feedback


Re: Apache Beam June Newsletter

2018-06-13 Thread Pablo Estrada
Thanks Gris! Lots of interesting things.
Best
-P.

On Wed, Jun 13, 2018 at 4:40 PM Griselda Cuevas  wrote:

> Hi Beam Community!
>
> Here
> 
>  [1]
> is the June Edition of our Apache Beam Newsletter. This edition was curated
> by our community of contributors, committers and PMCs. Generally, it
> contains the work done in the previous month (May in this case) and what's
> planned for the future.
>
> We hope to provide visibility to what's going on in the community, so if
> you have questions, feel free to ask in this thread.
>
> Cheers,
> Gris
>
> [1]
> https://docs.google.com/document/d/1BwRhOu-uDd3SLB_Om_Beke5RoGKos4hj7Ljh7zM2YIo/edit?ts=5b17fb92#
>
> --
> You received this message because you are subscribed to the Google Groups
> "datapls-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to datapls-team+unsubscr...@google.com.
> To post to this group, send email to datapls-t...@google.com.
> To view this discussion on the web visit
> https://groups.google.com/a/google.com/d/msgid/datapls-team/CAMtXPk6KnivR%3Dea8ObNhTVoacDDAn35_Nrsa52hLzY21SjJPEw%40mail.gmail.com
> 
> .
>
> --
> You received this message because you are subscribed to the Google Groups
> "dataflow-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to dataflow-team+unsubscr...@google.com.
> To post to this group, send email to dataflow-t...@google.com.
> To view this discussion on the web visit
> https://groups.google.com/a/google.com/d/msgid/dataflow-team/CAMtXPk6KnivR%3Dea8ObNhTVoacDDAn35_Nrsa52hLzY21SjJPEw%40mail.gmail.com
> 
> .
>
-- 
Got feedback? go/pabloem-feedback


Any Beamers at OSCON?

2018-07-15 Thread Pablo Estrada
Hello everyone,
I am reaching out for people that will be attending this year's OSCon in
Portland[1], starting tomorrow 10/16. I am aware of a few Beam people that
will be attending along with me, and I'm sure we'll be glad to see others.

I am also aware of one talk involving Beam, by Holden Karau [2].

Let me know if anyone else is also going to OSCon, so that we can meet up,
and chat. I'll be spending time at the Google booth, and maybe at some of
the talks!

Best
-P.

[1] https://conferences.oreilly.com/oscon/oscon-or
[2]
https://conferences.oreilly.com/oscon/oscon-or/public/schedule/detail/66320
-- 
Got feedback? go/pabloem-feedback


Re: [Call for items] Beam August Newsletter

2018-08-02 Thread Pablo Estrada
Thanks for doing this Rose! I'll add a couple of things.
-P.

On Thu, Aug 2, 2018, 4:18 PM Rose Nguyen  wrote:

> Hi all:
>
> Here's
> 
>  [1]
> the template for the August Beam Newsletter!
>
> *Add the highlights from June and July that you want to share with
> the community by 8/8 11:59 p.m. **PDT.*
>
> I'm working with Gris--we've heard your requests and will collect the
> notes via Google docs but send out the final version directly to the user
> mailing list. I'll edit and send the newsletter on 8/10.
>
> Thanks!
>
> [1]
> https://docs.google.com/document/d/124klHcJcIi_gD6rvMXwbbToINl1KTdXPmYrgGB998FQ/edit
> --
>
>
> Rose Thi Nguyen
>
>   Technical Writer
>
> (281) 683-6900
>
> --
> You received this message because you are subscribed to the Google Groups
> "datapls-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to datapls-team+unsubscr...@google.com.
> To post to this group, send email to datapls-t...@google.com.
> To view this discussion on the web visit
> https://groups.google.com/a/google.com/d/msgid/datapls-team/CAJ3qQ%2BH19_m4NHKOmgY4feYxrGaQQnn3GdByuDNUV%3D%2BkrhwE6w%40mail.gmail.com
> 
> .
>
> --
> You received this message because you are subscribed to the Google Groups
> "dataflow-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to dataflow-team+unsubscr...@google.com.
> To post to this group, send email to dataflow-t...@google.com.
> To view this discussion on the web visit
> https://groups.google.com/a/google.com/d/msgid/dataflow-team/CAJ3qQ%2BH19_m4NHKOmgY4feYxrGaQQnn3GdByuDNUV%3D%2BkrhwE6w%40mail.gmail.com
> 
> .
>
-- 
Got feedback? go/pabloem-feedback


[ANNOUNCE] Apache Beam 2.6.0 released!

2018-08-08 Thread Pablo Estrada
The Apache Beam team is pleased to announce the release of 2.6.0 version!

Apache Beam is an open source unified programming model to define and
execute data processing pipelines, including ETL, batch and stream
(continuous) processing. See https://beam.apache.org

You can download the release here:

https://beam.apache.org/get-started/downloads/

This release includes the following major new features & improvements,
among others:
- Improvements for internal Context Management in Python SDK
- A number of improvements to the Portability Framework
- A Universal Local Runner has been added to Beam. This runner runs in a
single machine using portability, and containerized SDK harnesses.
- Increased the coverage of ErrorProne analysis of the codebase.
- Updates to various dependency versions
- Updates to stability, performance, and documentation.
- SQL - improvements: support exists operator, implemented sum()
aggregations, fixes to CASE expression, support for date comparison,
support LIMIT on Unbounded Data
- Provide automatic schema registration for POJOs

You can take a look at the Release Notes for more details:

*https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343392&projectId=12319527
<https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343392&projectId=12319527https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343392&projectId=12319527>*

Thanks to everyone who participated in this release, and we hope you'll
have a good time using Beam 2.6.0.
--
Pablo Estrada, on behalf of The Apache Beam team
-- 
Got feedback? go/pabloem-feedback


Re: Beam Java newbie.

2018-08-11 Thread Pablo Estrada
Can you elaborate on your setup? Are you using a Maven archetype? Did you
hand-write a pom.xml? Are you installing from code with gradle?
Best
-P.

On Sat, Aug 11, 2018, 3:43 PM Mahesh Vangala 
wrote:

> Hello folks -
>
> I am enthusiastic about learning beam using java sdk.
> I set up maven using eclipse today.
> However, I ran into the following error message:
>
> Caused by: java.lang.ClassNotFoundException:
> org.apache.beam.vendor.protobuf.v3.com.google.protobuf.ProtocolMessageEnum
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
> ... 12 more
>
>
> Any help from the team is appreciated!
> Thanks.
>
> - Mahesh
>
-- 
Got feedback? go/pabloem-feedback


Re: Slides from OSCON

2018-09-04 Thread Pablo Estrada
I found this video:
https://www.safaribooksonline.com/videos/oscon-2018/9781492026075/9781492026075-video321550
Though it requires a suscription to Safari books.
Best
-P.

On Tue, Sep 4, 2018 at 8:21 AM Matthias Baetens 
wrote:

> Hey Holden, just checking if you were able to find something? :)
> -M
>
>
> On Sat, Aug 25, 2018, 12:07 Holden Karau  wrote:
>
>> Good question, I’ll take a look tonight :)
>>
>> On Sat, Aug 25, 2018 at 6:32 AM Matthias Baetens <
>> baetensmatth...@gmail.com> wrote:
>>
>>> Hey Holden,
>>> Have you found the video in the meanwhile? Happy to upload it to the
>>> Beam channel :)
>>> Cheers,
>>> M
>>>
>>> On Wed, 25 Jul 2018 at 02:52 Holden Karau  wrote:
>>>
 Hi Friends,

 If you missed my talk @ OSCON on TFT/TFMA + Beam on Flink (and other
 related adventures), I've uploaded my slides to
 https://www.slideshare.net/hkarau/powering-tensorflow-with-big-data-using-apache-beam-flink-and-spark-oscon-pdx-2018
 and once the video is done I'll try and find a way to share that as well.

 Cheers,

 Holden :)


 --
 Twitter: https://twitter.com/holdenkarau

>>> --
>>>
>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
> --
>
>


Re: Modular IO presentation at Apachecon

2018-09-27 Thread Pablo Estrada
I'll take this chance to plug in my little directory of Beam
tools/materials: https://github.com/pabloem/awesome-beam

Please feel free to send PRs : )

On Wed, Sep 26, 2018 at 10:29 PM Ankur Goenka  wrote:

> Thanks for sharing. Great slides and looking for the recorded session.
>
> Do we have a central location where we link all the beam presentations for
> discoverability?
>
> On Wed, Sep 26, 2018 at 9:35 PM Thomas Weise  wrote:
>
>> Thanks for sharing. I'm looking forward to see the recording of the talk
>> (hopefully!).
>>
>> This will be very helpful for Beam users. IO still is typically the
>> unexpectedly hard and time consuming part of authoring pipelines.
>>
>>
>> On Wed, Sep 26, 2018 at 2:48 PM Alan Myrvold  wrote:
>>
>>> Thanks for the slides.
>>> Really enjoyed the talk in person, especially the concept that IO is a
>>> transformation, and a source or sink are not special and the splittable
>>> DoFn explanation.
>>>
>>> On Wed, Sep 26, 2018 at 2:17 PM Ismaël Mejía  wrote:
>>>
 Hello, today Eugene and me did a talk about about modular APIs for IO
 at ApacheCon. This talk introduces some common patterns that we have
 found while creating IO connectors and also presents recent ideas like
 dynamic destinations, sequential writes among others using FileIO as a
 use case.

 In case you guys want to take a look, here is a copy of the slides, we
 will probably add this to the IO authoring documentation too.

 https://s.apache.org/beam-modular-io-talk

>>>


Re: Agenda for the Beam Summit London 2018

2018-09-27 Thread Pablo Estrada
Very exciting. I will have to miss it, but I'm excited to see what comes
out of it:)
Thanks to Gris, Matthias and other organizers.
Best
-P.

On Thu, Sep 27, 2018, 4:26 PM Jean-Baptiste Onofré  wrote:

> Great !! Thanks Gris.
>
> Looking forward to see you all next Monday in London.
>
> Regards
>
> JB
> Le 27 sept. 2018, à 18:03, Griselda Cuevas  a écrit:
>>
>> Hi Beam Community,
>>
>> We have finalized the agenda for the Beam Summit London 2018, it's here:
>> https://www.linkedin.com/feed/update/urn:li:activity:6450125487321735168/
>>
>>
>> We had a great amount of talk proposals, thank you so much to everyone
>> who submitted one! We also sold out the event, so we're very excited to see
>> the community growing.
>>
>>
>> See you around,
>>
>> Gris on behalf of the Organizing Committee
>>
>


Re: 2019 Beam Events

2018-12-04 Thread Pablo Estrada
FWIW, for some of these places that have interest (e.g. Brazil, Israel),
it's possible to create a group in meetup.com, and start gauging interest,
and looking for organizers.
Once a group of people with interest exists, it's easier to get interest /
sponsorship to bring speakers.
So if you are willing to create the group in meetup, Daniel, we can monitor
it and try to plan something as it grows : )
Best
-P.

On Tue, Dec 4, 2018 at 10:55 AM Daniel Salerno 
wrote:

>
> It's a shame that there are no events in Brazil ...
>
> =(
>
> Em ter, 4 de dez de 2018 às 13:12, OrielResearch Eila Arich-Landkof <
> e...@orielresearch.org> escreveu:
>
>> agree 👍
>>
>> On Tue, Dec 4, 2018 at 5:41 AM Chaim Turkel  wrote:
>>
>>> Israel would be nice to have one
>>> chaim
>>> On Tue, Dec 4, 2018 at 12:33 AM Griselda Cuevas  wrote:
>>> >
>>> > Hi Beam Community,
>>> >
>>> > I started curating industry conferences, meetups and events that are
>>> relevant for Beam, this initial list I came up with. I'd love your help
>>> adding others that I might have overlooked. Once we're satisfied with the
>>> list, let's re-share so we can coordinate proposal submissions, attendance
>>> and community meetups there.
>>> >
>>> >
>>> > Cheers,
>>> >
>>> > G
>>> >
>>> >
>>> >
>>>
>>> --
>>>
>>>
>>> Loans are funded by
>>> FinWise Bank, a Utah-chartered bank located in Sandy,
>>> Utah, member FDIC, Equal
>>> Opportunity Lender. Merchant Cash Advances are
>>> made by Behalf. For more
>>> information on ECOA, click here
>>> . For important information about
>>> opening a new
>>> account, review Patriot Act procedures here
>>> .
>>> Visit Legal
>>>  to
>>> review our comprehensive program terms,
>>> conditions, and disclosures.
>>>
>>
>>
>> --
>> Eila
>> www.orielresearch.org
>> https://www.meetu 
>> p.co 
>> m/Deep-Learning-In-Production/
>> 
>>
>>
>>


Re: Single threaded processing

2019-01-07 Thread Pablo Estrada
Hi Matt,
is this computation running as part of a larger pipeline that does run some
parallel processing? Otherwise, it's odd that it needs to run on Beam.
Nonetheless, you can certainly do this with a pipeline that has a single
element. Here's what that looks like in python:

p | beam.Create(['gs://myfile.json']) | beam.ParDo(LoadEachFile()) |
WriteToMyDatabase()

If, on the other hand, you may have a PCollection with multiple elements
(e.g. filenames), and you want to process them one-by-one, you can group
them all on a single key, like so:

my_filename_pcolll | GroupIntoSingleThread() | beam.PArDo(LoadEachFile()) |
WriteToMyDatabase()

Where the GroupIntoSingleThread transform looks basically like so:

input | beam.Map(lambda x: ('singlekey', x)) | beam.GroupByKey() |
beam.FlatMap(lambda x: x[1])

In this example, we are adding a single key to all elements, grouping them
all together, and then throwing away the key, to get each of the elements
one-by-one in a single thread. You can do something similar using side
inputs (with AsIter(my_filename_pcoll)).

Does that help? Or perhaps you could clarify a bit more about your use case.
Best
-P.

On Mon, Jan 7, 2019 at 1:33 PM Matt Casters  wrote:

> Hi Beam!
>
> There's a bunch of stuff that I would like to support and it's probably
> something silly but I couldn't find it immediately ... or I'm completely
> dim and making too much of certain things.
>
> The thing is, sometimes you just want to do a single threaded operations.
> For example, we sometimes need to read generic JSON or XML documents or
> perform single threaded bulk loads into certain databases.
> There's also simple relational database data you might want to side-load
> or data from some web service somewhere.
>
> So, how can I instruct Beam not to fire up a bunch of readers or writers,
> what is a good alternative for ParDo?
>
> Thanks in advance for any suggestions!
>
> Matt
> ---
> Matt Casters attcast...@gmail.com>
> Senior Solution Architect, Kettle Project Founder
>
>
>


Re: Single threaded processing

2019-01-07 Thread Pablo Estrada
Hi Matt,

I am much more familiar with Python, so I usually answer questions using
that SDK. Also, it's quicker to type a fully detailed pipeline on an email
and the SDKs are similar enough that it should not be too difficult to
translate to Java from an IDE.

To your questions:
1. Grouping like that should not create a single in-memory iterable.
Runners provide lazy iterables that load elements as you go through them.

2. The Java SDK provides the FileIO.match(), and FileIO.readMatches()
transforms, which generate a list of file metadata objects, and convert
them into file descriptors that can be read in a ParDo (respectively). I
think those should do the trick for you.

I guess, in Java you'd do something like so:

p.apply(FileIO.match().filepattern(...))
  .apply(ParDo.of(new AddSingleKeyDoFn())
  .apply(GroupByKey.create())
  .apply(ParDo.of(new DropKeyDoFn())
  .apply(FileIO.readMatches())
  .apply(ParDo.of(new ConsumeFileDescriptors());

You'll have to follow the FileIO documentation
<https://beam.apache.org/releases/javadoc/2.9.0/org/apache/beam/sdk/io/FileIO.html>[1]
to see what the file descriptors look like, and how to use the match
function.

Best
-P.

[1]
https://beam.apache.org/releases/javadoc/2.9.0/org/apache/beam/sdk/io/FileIO.html

On Mon, Jan 7, 2019 at 2:41 PM Matt Casters  wrote:

> Hi Pablo,
>
> Apologies, I thought the cases were very simple and clear.  Obviously I
> should have also mentioned I'm in Java land, not used to the script kiddy
> stuff :-)
>
> On the output side: thanks for the grouping "trick".  However, doesn't
> that mean that all rows will end up in a single in-memory Iterable?
>
> On the input side, reading a JSON or XML file shouldn't be black magic but
> apparently it's completely undocumented as far as I could tell.  Any tips
> there?
>
> Mind you, if it's not possible in the Beam API, feel free to let me know
> and I'll dive into the Beam source code to figure something out.
>
> Cheers,
>
> Matt
>
>
>
> ---
> Matt Casters attcast...@gmail.com>
> Senior Solution Architect, Kettle Project Founder
>
>
>
> Op ma 7 jan. 2019 om 23:09 schreef Pablo Estrada :
>
>> Hi Matt,
>> is this computation running as part of a larger pipeline that does run
>> some parallel processing? Otherwise, it's odd that it needs to run on Beam.
>> Nonetheless, you can certainly do this with a pipeline that has a single
>> element. Here's what that looks like in python:
>>
>> p | beam.Create(['gs://myfile.json']) | beam.ParDo(LoadEachFile()) |
>> WriteToMyDatabase()
>>
>> If, on the other hand, you may have a PCollection with multiple elements
>> (e.g. filenames), and you want to process them one-by-one, you can group
>> them all on a single key, like so:
>>
>> my_filename_pcolll | GroupIntoSingleThread() | beam.PArDo(LoadEachFile())
>> | WriteToMyDatabase()
>>
>> Where the GroupIntoSingleThread transform looks basically like so:
>>
>> input | beam.Map(lambda x: ('singlekey', x)) | beam.GroupByKey() |
>> beam.FlatMap(lambda x: x[1])
>>
>> In this example, we are adding a single key to all elements, grouping
>> them all together, and then throwing away the key, to get each of the
>> elements one-by-one in a single thread. You can do something similar using
>> side inputs (with AsIter(my_filename_pcoll)).
>>
>> Does that help? Or perhaps you could clarify a bit more about your use
>> case.
>> Best
>> -P.
>>
>> On Mon, Jan 7, 2019 at 1:33 PM Matt Casters 
>> wrote:
>>
>>> Hi Beam!
>>>
>>> There's a bunch of stuff that I would like to support and it's probably
>>> something silly but I couldn't find it immediately ... or I'm completely
>>> dim and making too much of certain things.
>>>
>>> The thing is, sometimes you just want to do a single threaded operations.
>>> For example, we sometimes need to read generic JSON or XML documents or
>>> perform single threaded bulk loads into certain databases.
>>> There's also simple relational database data you might want to side-load
>>> or data from some web service somewhere.
>>>
>>> So, how can I instruct Beam not to fire up a bunch of readers or
>>> writers, what is a good alternative for ParDo?
>>>
>>> Thanks in advance for any suggestions!
>>>
>>> Matt
>>> ---
>>> Matt Casters attcast...@gmail.com>
>>> Senior Solution Architect, Kettle Project Founder
>>>
>>>
>>>


Re: Single threaded processing

2019-01-07 Thread Pablo Estrada
I've just learned that there are these transforms that should be useful:

p.apply(FileIO.match().filepattern(...))
  .apply(WithKeys.of((Void) null))
  .apply(GroupByKey.create())
  .apply(Values.create())
  .apply(Flatten.itearables())
  .apply(FileIO.readMatches())
  .apply(ParDo.of(new ConsumeFileDescriptors());

-P.

On Mon, Jan 7, 2019 at 5:23 PM Pablo Estrada  wrote:

> Hi Matt,
>
> I am much more familiar with Python, so I usually answer questions using
> that SDK. Also, it's quicker to type a fully detailed pipeline on an email
> and the SDKs are similar enough that it should not be too difficult to
> translate to Java from an IDE.
>
> To your questions:
> 1. Grouping like that should not create a single in-memory iterable.
> Runners provide lazy iterables that load elements as you go through them.
>
> 2. The Java SDK provides the FileIO.match(), and FileIO.readMatches()
> transforms, which generate a list of file metadata objects, and convert
> them into file descriptors that can be read in a ParDo (respectively). I
> think those should do the trick for you.
>
> I guess, in Java you'd do something like so:
>
> p.apply(FileIO.match().filepattern(...))
>   .apply(ParDo.of(new AddSingleKeyDoFn())
>   .apply(GroupByKey.create())
>   .apply(ParDo.of(new DropKeyDoFn())
>   .apply(FileIO.readMatches())
>   .apply(ParDo.of(new ConsumeFileDescriptors());
>
> You'll have to follow the FileIO documentation
> <https://beam.apache.org/releases/javadoc/2.9.0/org/apache/beam/sdk/io/FileIO.html>[1]
> to see what the file descriptors look like, and how to use the match
> function.
>
> Best
> -P.
>
> [1]
> https://beam.apache.org/releases/javadoc/2.9.0/org/apache/beam/sdk/io/FileIO.html
>
> On Mon, Jan 7, 2019 at 2:41 PM Matt Casters  wrote:
>
>> Hi Pablo,
>>
>> Apologies, I thought the cases were very simple and clear.  Obviously I
>> should have also mentioned I'm in Java land, not used to the script kiddy
>> stuff :-)
>>
>> On the output side: thanks for the grouping "trick".  However, doesn't
>> that mean that all rows will end up in a single in-memory Iterable?
>>
>> On the input side, reading a JSON or XML file shouldn't be black magic
>> but apparently it's completely undocumented as far as I could tell.  Any
>> tips there?
>>
>> Mind you, if it's not possible in the Beam API, feel free to let me know
>> and I'll dive into the Beam source code to figure something out.
>>
>> Cheers,
>>
>> Matt
>>
>>
>>
>> ---
>> Matt Casters attcast...@gmail.com>
>> Senior Solution Architect, Kettle Project Founder
>>
>>
>>
>> Op ma 7 jan. 2019 om 23:09 schreef Pablo Estrada :
>>
>>> Hi Matt,
>>> is this computation running as part of a larger pipeline that does run
>>> some parallel processing? Otherwise, it's odd that it needs to run on Beam.
>>> Nonetheless, you can certainly do this with a pipeline that has a single
>>> element. Here's what that looks like in python:
>>>
>>> p | beam.Create(['gs://myfile.json']) | beam.ParDo(LoadEachFile()) |
>>> WriteToMyDatabase()
>>>
>>> If, on the other hand, you may have a PCollection with multiple elements
>>> (e.g. filenames), and you want to process them one-by-one, you can group
>>> them all on a single key, like so:
>>>
>>> my_filename_pcolll | GroupIntoSingleThread() |
>>> beam.PArDo(LoadEachFile()) | WriteToMyDatabase()
>>>
>>> Where the GroupIntoSingleThread transform looks basically like so:
>>>
>>> input | beam.Map(lambda x: ('singlekey', x)) | beam.GroupByKey() |
>>> beam.FlatMap(lambda x: x[1])
>>>
>>> In this example, we are adding a single key to all elements, grouping
>>> them all together, and then throwing away the key, to get each of the
>>> elements one-by-one in a single thread. You can do something similar using
>>> side inputs (with AsIter(my_filename_pcoll)).
>>>
>>> Does that help? Or perhaps you could clarify a bit more about your use
>>> case.
>>> Best
>>> -P.
>>>
>>> On Mon, Jan 7, 2019 at 1:33 PM Matt Casters 
>>> wrote:
>>>
>>>> Hi Beam!
>>>>
>>>> There's a bunch of stuff that I would like to support and it's probably
>>>> something silly but I couldn't find it immediately ... or I'm completely
>>>> dim and making too much of certain things.
>>>>
>>>> The thing is, sometimes you just want to do a single threaded
>>>> operations.
>>>> For example, we sometimes need to read generic JSON or XML documents or
>>>> perform single threaded bulk loads into certain databases.
>>>> There's also simple relational database data you might want to
>>>> side-load or data from some web service somewhere.
>>>>
>>>> So, how can I instruct Beam not to fire up a bunch of readers or
>>>> writers, what is a good alternative for ParDo?
>>>>
>>>> Thanks in advance for any suggestions!
>>>>
>>>> Matt
>>>> ---
>>>> Matt Casters attcast...@gmail.com>
>>>> Senior Solution Architect, Kettle Project Founder
>>>>
>>>>
>>>>


[Meetup] Apache Flink+Beam+others in Seattle. Feb 21.

2019-02-15 Thread Pablo Estrada
Hello everyone,
There is an upcoming meetup happening in the Google Seattle office, on
February 21st, starting at 5:30pm:
https://www.meetup.com/seattle-apache-flink/events/258723322/

People will be chatting about Beam, Flink, Hive, and AthenaX
. Anyone who is interested, please feel
free to join : )

FWIW, I am not organizing it, but didn't see it advertised on the lists.
I'm bringing it here so people know.
Best
-P.


Re: Beam Summit Europe 2019: CfP

2019-03-04 Thread Pablo Estrada
Thanks to everyone involved organizing this. This is exciting : )
Best
-P.

On Mon, Mar 4, 2019 at 1:27 PM Matthias Baetens 
wrote:

> Hi everyone,
>
> As you might already know, the *Beam Summit Europe 2019* will take place
> in *Berlin* this year on *19-20 June*!
>
> Of course, we would love to have you there. That is why we are opening the 
> *Call
> for Speakers .*
>
> We are looking for people to share use-cases for Beam, do a technical deep
> dive or deliver a workshop on Beam. We have a few standard slots, but don't
> hesitate to do a proposal in case you have something different in mind.
>
> Stay tuned for more announcements - we are working hard on the website,
> which should be up and running soon. And of course, don't hesitate to reach
> out if you have any questions or suggestions!
>
> Looking forward to seeing you in person later this year.
>
> Best regards,
> The Beam Summit organising team
>


Re: GSOC - Apache Beam Python SDK

2019-03-12 Thread Pablo Estrada
Hi Pasan!
Welcome to Apache Beam. Happy to have your interest. Can you share what are
your specific questions about the topic? My initial advice would be to
study the filesystems[1] packages of Beam, and the GCS filesystem[2].
As a piece of advice, you can find us in the ASF slack:
https://s.apache.org/beam-slack-channel, and ask questions there directly :
)
Best
-P.

[1]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystems.py
[1]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystem.py

[2]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/gcsfilesystem.py


On Tue, Mar 12, 2019 at 6:00 AM Pasan Kamburugamuwa <
pasankamburugamu...@gmail.com> wrote:

> I am an undergraduate from Sri Lanka Institute of Information Technology
> and I am in my 3rd year. So I would like to do an internship based on your
> organization and I am highly interested in Apache Beam project. So I want
> to know how to start my career in this project. I have gone through the
> documentation and I have some point of clarification about the project.
> So Please can you guys help me to engage in implement an S3 filesystem for
> python sdk
>
> Thank you
> Happy to join with large community.
>


Re: GSOC - Apache Beam Python SDK

2019-03-12 Thread Pablo Estrada
Oh, if you are not yet subscribed to the ASF slack, you can do so here:
https://s.apache.org/slack-invite

On Tue, Mar 12, 2019 at 10:30 AM Pablo Estrada  wrote:

> Hi Pasan!
> Welcome to Apache Beam. Happy to have your interest. Can you share what
> are your specific questions about the topic? My initial advice would be to
> study the filesystems[1] packages of Beam, and the GCS filesystem[2].
> As a piece of advice, you can find us in the ASF slack:
> https://s.apache.org/beam-slack-channel, and ask questions there directly
> : )
> Best
> -P.
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystems.py
> [1]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystem.py
>
> [2]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/gcsfilesystem.py
>
>
>
> On Tue, Mar 12, 2019 at 6:00 AM Pasan Kamburugamuwa <
> pasankamburugamu...@gmail.com> wrote:
>
>> I am an undergraduate from Sri Lanka Institute of Information Technology
>> and I am in my 3rd year. So I would like to do an internship based on your
>> organization and I am highly interested in Apache Beam project. So I want
>> to know how to start my career in this project. I have gone through the
>> documentation and I have some point of clarification about the project.
>> So Please can you guys help me to engage in implement an S3 filesystem
>> for python sdk
>>
>> Thank you
>> Happy to join with large community.
>>
>


Re: GSOC - Implement an S3 filesystem for Python SDK

2019-03-13 Thread Pablo Estrada
Hi Pasan!
I answered with some links an tips in your previous post. You can find them
here:
https://lists.apache.org/thread.html/c6637178a0fa5e4f0b2f1b3fe8991b79863f384d2573b6f22cb5f3b2@%3Cuser.beam.apache.org%3E

Best
-P.

On Tue, Mar 12, 2019 at 8:26 PM Pasan Kamburugamuwa <
pasankamburugamu...@gmail.com> wrote:

> Hi ,
>
> I am a 3rd year software Engineering undergraduate at Sri Lanka Institute
> of Information Technology(SLIIT), Sri Lanka. I am interested in this
> project for GSOC 2019. I have gone through the document and I would like to
> deep diving into the codebase. So can you please point me to any relevant
> issues so I can get more familiar with the codebase(I am really interested
> in this project)
>
>
>
>
>
> Thanks and best reagards
>
> Pasan kamburugamuwa
>
>
>
>
>
>
>


PipelineOptions at execution time from DirectRunner

2019-03-21 Thread Pablo Estrada
Hi all,
The DirectRunner does not seem to support RuntimeValueProvider. Is there a
suggestion for DirectRunner pipelines to access arguments passed in as
pipeline options(but not necessarily passed explicitly by users) at
pipeline execution time?

Getting it as pcoll.pipeline.options in the expand(self, pcoll) call is a
possiblity, but it seems like that's not ideal. Any other suggestions?

Should we simply support RuntimeValueProvider in direct runner?
Best
-P.


Re: PipelineOptions at execution time from DirectRunner

2019-03-21 Thread Pablo Estrada
Thanks Ahmet! These are illustrative explanations.

I still wonder about one question:

>
>> Getting it as pcoll.pipeline.options in the expand(self, pcoll) call is a
>> possiblity, but it seems like that's not ideal. Any other suggestions?
>>
> Is this an appropriate way of obtaining an option that is not explicitly
passed by the user? It prints a warning.

"What's the context?" - I'm working on a transform that writes to big
query, and table destinations can come in the form "dataset.table", or
"project:dataset.table". Because these are parsed at runtime (destinations
are dynamic), the ptransform checks if a project was provided in
pipelineoptions via value providers.

Thanks!
-P.



>> Should we simply support RuntimeValueProvider in direct runner?
>>
>
> This is a bit tricky for python. Because it is possible to run multiple
> pipeline with DirectRunner in the same process (e.g. call run and do not
> block on results.) RuntimeValueProvider works by setting a global variable
> and in the case of direct runner multiple pipelines could share the same
> process, it gets tricky to support this.
>
>
>> Best
>> -P.
>>
>


Re: Implementation an S3 file system for python SDK

2019-04-03 Thread Pablo Estrada
Hi Pasan!
Thanks for the proposal. I'll try to take a look in the next few hours and
give some feedback.
Best
--P.

On Wed, Apr 3, 2019, 8:53 AM Ahmet Altay  wrote:

> +Pablo Estrada 
>
> On Wed, Apr 3, 2019 at 8:46 AM Lukasz Cwik  wrote:
>
>> +dev 
>>
>> On Wed, Apr 3, 2019 at 2:03 AM Pasan Kamburugamuwa <
>> pasankamburugamu...@gmail.com> wrote:
>>
>>> Hi ,
>>>
>>> I have completed a proposal to implementation an S3 file system for
>>> python SDK for the google summer of Code 2019. Please can you guys review
>>> this proposal and if there is any issues with this proposal, let me know.
>>> Here is the link to the project proposal -
>>>
>>> https://docs.google.com/document/d/1i_PoIrbmhNgwKCS1TYWC28A9RsyZQFsQCJic3aCXO-8/edit?usp=sharing
>>>
>>> Thank you
>>> Pasan Kamburugamuwa
>>>
>>>


Re: Is there an integration test available for filesystem checking

2019-04-08 Thread Pablo Estrada
I recommend you send these questions to the dev@ list Pasan.

Have you looked at the *_test.py files corresponding to each one of the
file systems? Are they all mocking their access to GCS?
Best
-P.

On Sun, Apr 7, 2019 at 11:12 PM Pasan Kamburugamuwa <
pasankamburugamu...@gmail.com> wrote:

> Hello,
>
> I am currently updating the project proposal which I have already sent to
> the community to get the feedback. So I am having a problem with it, I want
> to know is there any integration testing available for test the filesystem.
>
> Thanks
> pasan kamburugamuwa
>


Re: Implementation an S3 file system for python SDK - Updated

2019-04-08 Thread Pablo Estrada
Currently, Pasan is working on a design for adding a couple implementations
to the Filesystem interface in Python, and it's not necessary to consider
SDF here. IMHO.

On the other hand, Python's fileio[1] could probably use SDF-based
improvements to split when many files are being matched.
Best
-P.

On Mon, Apr 8, 2019 at 10:00 AM Alex Amato  wrote:

> +Lukasz Cwik , +Boyuan Zhang , +Lara
> Schmidt 
>
> Should splittable DoFn be considered in this design? In order to split and
> scale the source step properly?
>
> On Mon, Apr 8, 2019 at 9:11 AM Ahmet Altay  wrote:
>
>> +dev  +Pablo Estrada  +Chamikara
>> Jayalath  +Udi Meiri 
>>
>> Thank you Pasan. I quickly looked at the proposal and it looks good.
>> Added a few folks who could offer additional feedback.
>>
>> On Mon, Apr 8, 2019 at 12:13 AM Pasan Kamburugamuwa <
>> pasankamburugamu...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have updated the project proposal according to the given feedback. So
>>> can you guys check my proposal again and give me your feedback about
>>> corrections I have done.
>>>
>>> Here is the link to the updated project proposal
>>>
>>> https://docs.google.com/document/d/1i_PoIrbmhNgwKCS1TYWC28A9RsyZQFsQCJic3aCXO-8/edit?usp=sharing
>>>
>>> Thank you
>>> Pasan Kamburugamuwa
>>>
>>


Re: Hi, some sample about Extracting data from Xlsx ?

2019-04-15 Thread Pablo Estrada
Hello Henrique,

I am not aware of existing Beam transforms specifically used for reading in
XLSX data. Can you share what you mean by "examples related with Cs
extension"?

I am aware of some Python libraries foir this sort of thing[1]. You could
use the FileIO transforms in the Python SDK to find each file, and then
write a DoFn that is able to read in data from these files. Check out this
unit test using FileIO to read CSV files[2].

Let me know if that helps, or if I went on the wrong direction of what you
needed.
Best
-P.

[1] https://openpyxl.readthedocs.io/en/stable/
[2]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio_test.py#L128-L148

On Mon, Apr 15, 2019 at 12:47 PM Henrique Molina 
wrote:

> Hello
>
> I would like to use best practices from Apache Beams to read Xlsx. however
> I found examples only related with Cs extension.
> someone there is sample using ParDo to Collect all columns and sheets from
> Excel xlsx ?
> Afterwards I will put into google Big query.
>
> Thanks & Regards
>
>


Re: Hi, some sample about Extracting data from Xlsx ?

2019-04-16 Thread Pablo Estrada
Hm I am not very familiar with POI, but if its transforms are able to take
in a file descriptor, you should be able to use FileIO.match()[0] to find
your files (local, or in GCS/S3/HDFS); and FileIO.readMatches()[1] to get
file descriptors for these files.

If the POI libraries require the files to be local in your machine, you may
need to use FileSystems.copy[2] to move your files locally, and then
analyze them.

Let me know if those are some useful building blocks for your pipeline,
Best
-P.

[0]
https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/FileIO.html#match--

[1]
https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/FileIO.html#readMatches--
[2]
https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/FileSystems.html#copy-java.util.List-java.util.List-org.apache.beam.sdk.io.fs.MoveOptions...-


On Mon, Apr 15, 2019 at 6:20 PM Henrique Molina 
wrote:

> Hi Pablo ,
> Thanks for your attention,
> I so sorry, my bad written "Cs extension " I did means .csv extension !
> The example like this: load-csv-file-from-google-cloud-storage
> <https://kontext.tech/docs/DataAndBusinessIntelligence/p/load-csv-file-from-google-cloud-storage-to-bigquery-using-dataflow>
>
> I was think Using apache POI to read each row from sheet  throwing to next
> ParDo an CellRow rows
> same like that:
> .apply("xlsxToMap", ParDo.of(new DoFn() {.
>
> I don't know if it is more ellegant...
>
> If your have some Idea ! let me know . it will be welcome!!
>
>
> On Mon, Apr 15, 2019 at 6:01 PM Pablo Estrada  wrote:
>
>> Hello Henrique,
>>
>> I am not aware of existing Beam transforms specifically used for reading
>> in XLSX data. Can you share what you mean by "examples related with Cs
>> extension"?
>>
>> I am aware of some Python libraries foir this sort of thing[1]. You could
>> use the FileIO transforms in the Python SDK to find each file, and then
>> write a DoFn that is able to read in data from these files. Check out this
>> unit test using FileIO to read CSV files[2].
>>
>> Let me know if that helps, or if I went on the wrong direction of what
>> you needed.
>> Best
>> -P.
>>
>> [1] https://openpyxl.readthedocs.io/en/stable/
>> [2]
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio_test.py#L128-L148
>>
>> On Mon, Apr 15, 2019 at 12:47 PM Henrique Molina <
>> henrique.mol...@gmail.com> wrote:
>>
>>> Hello
>>>
>>> I would like to use best practices from Apache Beams to read Xlsx.
>>> however I found examples only related with Cs extension.
>>> someone there is sample using ParDo to Collect all columns and sheets
>>> from Excel xlsx ?
>>> Afterwards I will put into google Big query.
>>>
>>> Thanks & Regards
>>>
>>>
>>


Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-02 Thread Pablo Estrada
An example that I can think of as a feature that Beam could provide to
other runners is SQL. Beam SQL expands into Beam transforms, and it can run
on other runners. Flink and Spark do have SQL support because they've
invested in it, but think of smaller runners e.g. Nemo.

Of course, not all of Beam's features or abstractions work the same way,
but this is one case.
Best
-P.

On Thu, May 2, 2019 at 10:39 AM kant kodali  wrote:

> If people don't want to use it because crucial libraries are written in
> only some language but not available in others, that makes some sense
> otherwise I would think it is biased(which is what happens most of the
> time). A lot of the Language arguments are biased anyways since most of
> them just talk about syntactic sugar all day.
>
> "For many use cases, the cost of retraining data analysts, software
> engineers, data scientists, ... to use a language they are unfamiliar it is
> a much greater cost (not just salaries but delays in project completion)
> then the cost of the hardware that the jobs run on. Once the cost of the
> jobs are significant, paying to optimize it via a different implementation,
> performance tuning, ... becomes worthwhile."
>
> I agree with this and By different implementation, I am assuming you meant
> optimizing the language that "data analysts, software engineers, data
> scientists" are familiar with. On the contrary, I don't understand why
> Google pays a particular group to come up with new languages when there are
> so many languages already available!
>
>
> On Thu, May 2, 2019 at 10:00 AM Lukasz Cwik  wrote:
>
>>
>>
>> On Thu, May 2, 2019 at 6:29 AM Maximilian Michels  wrote:
>>
>>> Couple of comments:
>>>
>>> * Flink transforms
>>>
>>> It wouldn't be hard to add a way to run arbitrary Flink operators
>>> through the Beam API. Like you said, once you go down that road, you
>>> loose the ability to run the pipeline on a different Runner. And that's
>>> precisely one of the selling points of Beam. I'm afraid once you even
>>> allow 1% non-portable pipelines, you have lost it all.
>>>
>>> Now, it would be a different story if we had a runner-agnostic way of
>>> running Flink operators on top of Beam. For a subset of the Flink
>>> transformations that might actually be possible. I'm not sure if it's
>>> feasible for Beam to depend on the Flink API.
>>>
>>> * Pipeline Tuning
>>>
>>> There are less bells and whistlers in the Beam API then there are in
>>> Flink's. I'd consider that a feature. As Robert pointed out, the Runner
>>> can make any optimizations that it wants to do. If you have an idea for
>>> an optimizations we could built it into the FlinkRunner.
>>>
>>> I'd also consider if we could add an easier way for the user to apply a
>>> custom optimization code, apart from forking the FlinkRunner.
>>>
>>> * Lock-In
>>>
>>> > Is it possible that in the near future, most of Beam's capabilities
>>> would favor Google's Dataflow API?
>>>
>>> I think that was true for the predecessor of Beam which was built for
>>> Google Cloud Dataflow, although even then there were different runtimes
>>> within Google, i.e. FlumeJava (batch) and Millwheel (streaming).
>>>
>>> The idea of Beam is to build a framework that works in the open-source
>>> as well as in proprietary Runners. As with any Apache project, there are
>>> different interests within the project. A healthy community will ensure
>>> that the interests are well-balanced. The Apache development model also
>>> has the advantage that small parties cannot be simply overruled.
>>>
>>
>> +1
>>
>>
>>> * Language portability
>>>
>>> > "one of the things Beam has focused on was a language portability
>>> framework"  Sure but how important is this for a typical user? Do people
>>> stop using a particular tool because it is in an X language? I
>>>
>>> It is very important to some people. So important that they wouldn't use
>>> a system which does not offer it. Possible reasons: crucial libraries
>>> only available in Python, users that refuse to use Java.
>>>
>>
>> For many use cases, the cost of retraining data analysts, software
>> engineers, data scientists, ... to use a language they are unfamiliar in is
>> a much greater cost (not just salaries but delays in project completion)
>> then the cost of the hardware that the jobs run on. Once the cost of the
>> jobs are significant, paying to optimize it via a different implementation,
>> performance tuning, ... becomes worthwhile.
>>
>>
>>>
>>> Cheers,
>>> Max
>>>
>>> On 02.05.19 13:44, Robert Bradshaw wrote:
>>> > Correct, there's no out of the box way to do this. As mentioned, this
>>> > would also result in non-portable pipelines. However, even the
>>> > portability framework is set up such that runners can recognize
>>> > particular transforms and provide their own implementations thereof
>>> > (which is how translations are done for ParDo, GroupByKey, etc.) and
>>> > it is encouraged that runners do this for composite operations they
>>> > have can do be

[ANNOUNCE] Apache Beam 2.40.0 Released

2022-06-27 Thread Pablo Estrada
The Apache Beam team is pleased to announce the release of version 2.40.0.

Apache Beam is an open source unified programming model to define and
execute data processing pipelines, including ETL, batch and stream
(continuous) processing. See https://beam.apache.org

You can download the release here:

https://beam.apache.org/get-started/downloads/

This release includes bug fixes, features, and improvements detailed on the
Beam blog: https://beam.apache.org/blog/beam-2.40.0/

Thanks to everyone who contributed to this release, and we hope you enjoy
using Beam 2.40.0.

-- Pablo, on behalf of The Apache Beam team


Re: [ANNOUNCE] Apache Beam 2.41.0 Released

2022-08-25 Thread Pablo Estrada via user
Thank you Kiley!

On Thu, Aug 25, 2022 at 10:55 AM Kiley Sok  wrote:

> The Apache Beam team is pleased to announce the release of version 2.41.0.
>
> Apache Beam is an open source unified programming model to define and
> execute data processing pipelines, including ETL, batch and stream
> (continuous) processing. See https://beam.apache.org
>
> You can download the release here:
>
> https://beam.apache.org/get-started/downloads/
>
> This release includes bug fixes, features, and improvements detailed on
> the Beam blog: https://beam.apache.org/blog/beam-2.41.0/
>
> Thanks to everyone who contributed to this release, and we hope you enjoy
> using Beam 2.41.0.
>
> -- Kiley, on behalf of The Apache Beam team
>


Re: Where to specify trust.jks

2023-05-18 Thread Pablo Estrada via user
Hi Utkarsh,
you can pass a path in GCS (or a filesystem), and the workers should be
able to download it onto themselves. You'd pass
`gs://my-bucket-name/path/to/trust.jks`. Can you try that?
Best
-P.

On Wed, May 10, 2023 at 1:58 PM Utkarsh Parekh 
wrote:

> Hi,
>
> I'm testing a streaming app using kafka, Dafaflow, and Apache beam
> [Python].
>
>  "Error message from worker: org.apache.beam.sdk.util.UserCodeException:
> java.lang.RuntimeException: org.apache.kafka.common.KafkaException: Failed
> to construct kafka consumer
> org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:39)
> org.apache.beam.sdk.io.Read$UnboundedSourceAsSDFWrapperFn$DoFnInvoker.invokeSplitRestriction(Unknown
> Source)
> org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForSplitRestriction(FnApiDoFnRunner.java:888)
> org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:313)
> org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:245)
> org.apache.beam.fn.harness.FnApiDoFnRunner.outputTo(FnApiDoFnRunner.java:1789)
> org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForPairWithRestriction(FnApiDoFnRunner.java:825)
> org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:313)
> org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:245)
> org.apache.beam.fn.harness.FnApiDoFnRunner.outputTo(FnApiDoFnRunner.java:1789)
> org.apache.beam.fn.harness.FnApiDoFnRunner.access$3000(FnApiDoFnRunner.java:143)
> org.apache.beam.fn.harness.FnApiDoFnRunner$NonWindowObservingProcessBundleContext.outputWithTimestamp(FnApiDoFnRunner.java:2357)
> org.apache.beam.fn.harness.FnApiDoFnRunner$ProcessBundleContextBase.output(FnApiDoFnRunner.java:2527)
> org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.output(DoFnOutputReceivers.java:78)
> org.apache.beam.sdk.io.Read$OutputSingleSource.processElement(Read.java:1018)
> org.apache.beam.sdk.io.Read$OutputSingleSource$DoFnInvoker.invokeProcessElement(Unknown
> Source)
> org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForParDo(FnApiDoFnRunner.java:800)
> org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:313)
> org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:245)
> org.apache.beam.fn.harness.BeamFnDataReadRunner.forwardElementToConsumer(BeamFnDataReadRunner.java:213)
> org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver2.multiplexElements(BeamFnDataInboundObserver2.java:158)
> org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:533)
> org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:151)
> org.apache.beam.fn.harness.control.BeamFnControlClient$InboundObserver.lambda$onNext$0(BeamFnControlClient.java:116)
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> org.apache.beam.sdk.util.UnboundedScheduledExecutorService$ScheduledFutureTask.run(UnboundedScheduledExecutorService.java:162)
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> java.base/java.lang.Thread.run(Thread.java:829) Caused by:
> java.lang.RuntimeException: org.apache.kafka.common.KafkaException: Failed
> to construct kafka consumer
> org.apache.beam.sdk.io.kafka.KafkaUnboundedSource.createReader(KafkaUnboundedSource.java:136)
> org.apache.beam.sdk.io.kafka.KafkaUnboundedSource.createReader(KafkaUnboundedSource.java:45)
> org.apache.beam.sdk.io.Read$UnboundedSourceAsSDFWrapperFn$UnboundedSourceAsSDFRestrictionTracker.initializeCurrentReader(Read.java:843)
> org.apache.beam.sdk.io.Read$UnboundedSourceAsSDFWrapperFn$UnboundedSourceAsSDFRestrictionTracker.getProgress(Read.java:975)
> org.apache.beam.sdk.transforms.reflect.ByteBuddyDoFnInvokerFactory$DefaultGetSize.invokeGetSize(ByteBuddyDoFnInvokerFactory.java:432)
> org.apache.beam.sdk.io.Read$UnboundedSourceAsSDFWrapperFn$DoFnInvoker.invokeGetSize(Unknown
> Source)
> org.apache.beam.fn.harness.FnApiDoFnRunner$SizedRestrictionNonWindowObservingProcessBundleContext.outputWithTimestamp(FnApiDoFnRunner.java:2319)
> org.apache.beam.fn.harness.FnApiDoFnRunner$ProcessBundleContextBase.output(FnApiDoFnRunner.java:2527)
> org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.output(DoFnOutputReceivers.java:78)
> org.apache.beam.sdk.io.Read$UnboundedSourceAsSDFWrapperFn.splitRestriction(Read.java:524)
> Caused by: org.apache.ka

Re: 🗽 Join us in NYC at Beam Summit 2023

2023-05-25 Thread Pablo Estrada via user
let's goo

On Thu, May 25, 2023 at 12:49 PM Carolina Escobar 
wrote:

> *Get to know our speakers!*
>
> *Take a quick peek at our program:*
>
>
>-
>
>*Beam IO: CDAP and SparkReceiver IO Connectors Overview *
>Alex Kosolapov and  Elizaveta Lomteva give an overview of a Beam IO
>development process and practical insights gained from experience
>developing CDAP IO and SparkReceiverIO.
>-
>
>*Beam in Nokia NWDAF Distributed Architecture*
>Ifat Afek and Sigalit Aliazov present the analytics requirements of 5G
>stand-alone core, focusing on the NWDAF use cases.
>-
>
>*From Dataflow Templates to Beam: Chartboost’s Journey*
>Austin Bennett & Ferran Fernandez share some of their experiences
>around performance benchmarks, which were done to quantify the potential
>tradeoffs between the developer and computational efficiency.
>-
>
>*Easy cross-language with SchemaTransforms: use your favorite Java
>transform in Python SDK*
>Ahmed Abualsaud explains how to take a Java transform, develop a
>corresponding SchemaTransform for it, then discover and use it in a Python
>SDK pipeline.
>-
>
>*Mapping Data to FHIR with Apache Beam*
>Alex Fragotsis explains how he developed a reusable self-serve system
>based on Apache Beam to make it easier for other teams to develop and
>deploy their own mappers easily using Python UDFs
>-
>
>*Managing dependencies of Python pipelines*
>Valentyn Tymofieiev discusses approaches to configure software
>dependencies of Apache Beam Python pipelines, how to avoid rough edges
>during pipeline submission, and how to debug issues pertaining to
>dependency management when they occur.
>
>
>
> Check out the Program 
> Register Now 
> [image: Tw]  [image: Yt]
>  [image: In]
> 
>