Re: PipelineOptions at execution time from DirectRunner

2019-03-21 Thread Pablo Estrada
Thanks Ahmet! These are illustrative explanations.

I still wonder about one question:

>
>> Getting it as pcoll.pipeline.options in the expand(self, pcoll) call is a
>> possiblity, but it seems like that's not ideal. Any other suggestions?
>>
> Is this an appropriate way of obtaining an option that is not explicitly
passed by the user? It prints a warning.

"What's the context?" - I'm working on a transform that writes to big
query, and table destinations can come in the form "dataset.table", or
"project:dataset.table". Because these are parsed at runtime (destinations
are dynamic), the ptransform checks if a project was provided in
pipelineoptions via value providers.

Thanks!
-P.



>> Should we simply support RuntimeValueProvider in direct runner?
>>
>
> This is a bit tricky for python. Because it is possible to run multiple
> pipeline with DirectRunner in the same process (e.g. call run and do not
> block on results.) RuntimeValueProvider works by setting a global variable
> and in the case of direct runner multiple pipelines could share the same
> process, it gets tricky to support this.
>
>
>> Best
>> -P.
>>
>


Re: PipelineOptions at execution time from DirectRunner

2019-03-21 Thread Ahmet Altay
On Thu, Mar 21, 2019 at 4:20 PM Pablo Estrada  wrote:

> Hi all,
> The DirectRunner does not seem to support RuntimeValueProvider. Is there a
> suggestion for DirectRunner pipelines to access arguments passed in as
> pipeline options(but not necessarily passed explicitly by users) at
> pipeline execution time?
>

RuntimeValueProvider's are useful when some pipeline options are not set at
the time of pipeline construction but available at execution time. In case
of DirectRunner what is available at execution time is same as what is
available at construction time. Am I missing something, why do we
need RuntimeValueProvider support in DirectRunner?


>
> Getting it as pcoll.pipeline.options in the expand(self, pcoll) call is a
> possiblity, but it seems like that's not ideal. Any other suggestions?
>

> Should we simply support RuntimeValueProvider in direct runner?
>

This is a bit tricky for python. Because it is possible to run multiple
pipeline with DirectRunner in the same process (e.g. call run and do not
block on results.) RuntimeValueProvider works by setting a global variable
and in the case of direct runner multiple pipelines could share the same
process, it gets tricky to support this.


> Best
> -P.
>


PipelineOptions at execution time from DirectRunner

2019-03-21 Thread Pablo Estrada
Hi all,
The DirectRunner does not seem to support RuntimeValueProvider. Is there a
suggestion for DirectRunner pipelines to access arguments passed in as
pipeline options(but not necessarily passed explicitly by users) at
pipeline execution time?

Getting it as pcoll.pipeline.options in the expand(self, pcoll) call is a
possiblity, but it seems like that's not ideal. Any other suggestions?

Should we simply support RuntimeValueProvider in direct runner?
Best
-P.


Re: [spark runner dataset POC] workCount works !

2019-03-21 Thread Kenneth Knowles
Nice milestone!

On Thu, Mar 21, 2019 at 10:49 AM Pablo Estrada  wrote:

> This is pretty cool. Thanks for working on this and for sharing:)
> Best
> -P.
>
> On Thu, Mar 21, 2019, 8:18 AM Alexey Romanenko 
> wrote:
>
>> Good job! =)
>> Congrats to all who was involved to move this forward!
>>
>> Btw, for all who is interested in a progress of work on this runner, I
>> wanted to remind that we have #beam-spark channel on Slack where we discuss
>> all ongoing questions. Feel free to join!
>>
>> Alexey
>>
>> > On 21 Mar 2019, at 15:51, Jean-Baptiste Onofré  wrote:
>> >
>> > Congrats and huge thanks !
>> >
>> > (I'm glad to be one of the little "launcher" to this effort ;) )
>> >
>> > Regards
>> > JB
>> >
>> > On 21/03/2019 15:47, Ismaël Mejía wrote:
>> >> This is excellent news. Congrats Etienne, Alexey and the others
>> >> involved for the great work!
>> >> On Thu, Mar 21, 2019 at 3:10 PM Etienne Chauchot 
>> wrote:
>> >>>
>> >>> Hi guys,
>> >>>
>> >>> We are glad to announce that the spark runner POC that was re-written
>> from scratch using the structured-streaming framework and the dataset API
>> can now run WordCount !
>> >>>
>> >>> It is still embryonic. For now it only runs in batch mode and there
>> is no fancy stuff like state, timer, SDF, metrics, ... but it is still a
>> major step forward !
>> >>>
>> >>> Streaming support work has just started.
>> >>>
>> >>> You can find the branch here:
>> https://github.com/apache/beam/tree/spark-runner_structured-streaming
>> >>>
>> >>> Enjoy,
>> >>>
>> >>> Etienne
>> >>>
>> >>>
>>
>>


Re: [spark runner dataset POC] workCount works !

2019-03-21 Thread Pablo Estrada
This is pretty cool. Thanks for working on this and for sharing:)
Best
-P.

On Thu, Mar 21, 2019, 8:18 AM Alexey Romanenko 
wrote:

> Good job! =)
> Congrats to all who was involved to move this forward!
>
> Btw, for all who is interested in a progress of work on this runner, I
> wanted to remind that we have #beam-spark channel on Slack where we discuss
> all ongoing questions. Feel free to join!
>
> Alexey
>
> > On 21 Mar 2019, at 15:51, Jean-Baptiste Onofré  wrote:
> >
> > Congrats and huge thanks !
> >
> > (I'm glad to be one of the little "launcher" to this effort ;) )
> >
> > Regards
> > JB
> >
> > On 21/03/2019 15:47, Ismaël Mejía wrote:
> >> This is excellent news. Congrats Etienne, Alexey and the others
> >> involved for the great work!
> >> On Thu, Mar 21, 2019 at 3:10 PM Etienne Chauchot 
> wrote:
> >>>
> >>> Hi guys,
> >>>
> >>> We are glad to announce that the spark runner POC that was re-written
> from scratch using the structured-streaming framework and the dataset API
> can now run WordCount !
> >>>
> >>> It is still embryonic. For now it only runs in batch mode and there is
> no fancy stuff like state, timer, SDF, metrics, ... but it is still a major
> step forward !
> >>>
> >>> Streaming support work has just started.
> >>>
> >>> You can find the branch here:
> https://github.com/apache/beam/tree/spark-runner_structured-streaming
> >>>
> >>> Enjoy,
> >>>
> >>> Etienne
> >>>
> >>>
>
>


Re: joda-time dependency version

2019-03-21 Thread Kenneth Knowles
+dev@

I don't know of any special reason we are using an old version.

Kenn

On Thu, Mar 21, 2019, 09:38 Ismaël Mejía  wrote:

> Does anyone have any context on why we have such an old version of
> Joda time (2.4 released on  2014!) and if there is any possible issue
> upgrading it? If not maybe we can try to upgrade it..
>
> On Thu, Mar 21, 2019 at 5:35 PM Ismaël Mejía  wrote:
> >
> > Mmmm interesting issue. There is also a plan to use a vendored version
> > of joda-time not sure on the progress on that one.
> > https://issues.apache.org/jira/browse/BEAM-5827
> >
> > For Beam 3 that's the idea but  so far there is not at ETA for Beam 3.
> > https://issues.apache.org/jira/browse/BEAM-5530
> >
> > On Thu, Mar 21, 2019 at 4:15 PM rahul patwari
> >  wrote:
> > >
> > > Hi David,
> > >
> > > The only incompatibility we have come across is this:
> > > We have some timestamp format conversions in our project, where we are
> converting from a timestamp format to another.
> > >
> > > With joda-time 2.4:
> > > If we convert "2019-03-15 13:56:12" which is in "-MM-dd HH:mm:ss"
> format, to "hh:mm:ss yy-MMM-dd z" format, the converted value is "01:56:12
> 19-Mar-15 -07:00".
> > >
> > > Whereas with joda-time 2.9.3:
> > > If we convert "2019-03-15 13:56:12" which is in "-MM-dd HH:mm:ss"
> format, to "hh:mm:ss yy-MMM-dd z" format, the converted value is "01:56:12
> 19-Mar-15 PDT".
> > >
> > > The javadoc for both the versions doesn't seem different though, for
> 'z' DateTimeFormat.
> > >
> > > Even though the javadoc says - Zone names: Time zone names ('z')
> cannot be parsed for both the versions, we are able to parse it in
> joda-time 2.9.3.
> > >
> > > Also, joda-time will be replaced with java time with Beam 3?
> > >
> > > Thanks,
> > > Rahul
> > >
> > > On Thu, Mar 21, 2019 at 5:37 PM David Morávek 
> wrote:
> > >>
> > >> Hello Rahul, are there any incompatibilities you are running into
> with spark version? These versions should be backward compatible.
> > >>
> > >> For jodatime doc:
> > >> The main public API will remain backwards compatible for both source
> and binary in the 2.x stream.
> > >>
> > >> This means you should be able to safely use Spark's version.
> > >>
> > >> D.
> > >>
> > >> On Thu, Mar 21, 2019 at 5:45 AM rahul patwari <
> rahulpatwari8...@gmail.com> wrote:
> > >>>
> > >>> Hi Ismael,
> > >>>
> > >>> We are using Beam with Spark Runner and Spark 2.4 has joda-time
> 2.9.3 as a dependency. So, we have used joda-time 2.9.3 in our shaded
> artifact set. As Beam has joda-time 2.4 as a dependency, I was wondering
> whether it would break anything in Beam.
> > >>>
> > >>> Will joda-time be replaced with java time in Beam 3? What is the
> expected release date of Beam 3?
> > >>>
> > >>> Thanks,
> > >>> Rahul
> > >>>
> > >>> On Wed, Mar 20, 2019 at 7:23 PM Ismaël Mejía 
> wrote:
> > 
> >  Hello,
> > 
> >  The long term goal would be to get rid of joda-time but that won't
> >  happen until Beam 3.
> >  Any 'particular' reason or motivation to push the upgrade?
> > 
> >  Regards,
> >  Ismaël
> > 
> >  On Wed, Mar 20, 2019 at 11:53 AM rahul patwari
> >   wrote:
> >  >
> >  > Hi,
> >  >
> >  > Is there a plan to upgrade the dependency version of joda-time to
> 2.9.3 or latest version?
> >  >
> >  >
> >  > Thanks,
> >  > Rahul
>


Re: [BEAM-6862] Adding pyhamcrest library to python container

2019-03-21 Thread Ahmet Altay
This is fine, thank you for sending a note.

In the future, when we are ready to make container releases along with Beam
releases, we can have a cleaned up version of this container.

On Thu, Mar 21, 2019 at 9:36 AM Mikhail Gryzykhin  wrote:

> Hi everyone,
>
> Recently, there was added a test for verifying metrics in python SDK (
> PR-8038 ).
>
> This PR causes beam_PostCommit_Py_ValCont job to fail
> 
> due to lack of pyhamcrest library in python SDK container.
>
> I have created a PR-8107  that
> adds relevant library and it fixes test. However I want to first confirm
> that we are OK with adding library used for testing to prod container or
> whether we want to utilize some other approach for testing.
>
> For a background: we already have tenacity library added this way.
>
> Regards,
> --Mikhail
>
> Have feedback ?
>


Re: Python36/37 not installed on Beam2 and Beam12?

2019-03-21 Thread Mark Liu
Thanks for the efforts! I followed up on INFRA-17335
 and slack thread. And
also created https://issues.apache.org/jira/browse/INFRA-18070 to track
this issue separately. Hope this can help!

Mark

On Wed, Mar 20, 2019 at 7:43 PM Valentyn Tymofieiev 
wrote:

> I asked them yesterday on Slack, and commented on existing issue
> https://issues.apache.org/jira/browse/INFRA-17335, however didn't receive
> a response. We can try opening another infra ticket. Mark, perhaps you can
> quote/+1 my message from yesterday in thier slack channel :) ?
>
> On Wed, Mar 20, 2019 at 6:23 PM Yifan Zou  wrote:
>
>> You could try to ping them in the slack channel
>> https://the-asf.slack.com/messages/  if it is really urgent.
>>
>> On Wed, Mar 20, 2019 at 5:29 PM Mark Liu  wrote:
>>
>>> Hi,
>>>
>>> I saw occasional py36 tox test failure in beam_PreCommit_Python
>>> and beam_Release_NightlySnapshot in cron job
>>>  as
>>> well as PR triggered job
>>> . The
>>> error is simple:
>>>
>>> ERROR: InterpreterNotFound: python3.6
>>>
>>> Turns out those failures only happened in Beam2 and Beam12. From console
>>> log of inventory jobs (beam2
>>>  and
>>> beam12 ),
>>> I found python3.6 and python3.7 interpreters are missing. This makes
>>> beam_PreCommit_Python_Cron
>>>  flaky
>>> recently and may fail any python build that runs on those two nodes.
>>>
>>> Infra team helped install Python3 on our Jenkins before, but they were
>>> slow for response on JIRA. What's the best way to have Infra team get
>>> involved to this problem?
>>>
>>> Thanks,
>>> Mark
>>>
>>


[BEAM-6862] Adding pyhamcrest library to python container

2019-03-21 Thread Mikhail Gryzykhin
Hi everyone,

Recently, there was added a test for verifying metrics in python SDK (
PR-8038 ).

This PR causes beam_PostCommit_Py_ValCont job to fail

due to lack of pyhamcrest library in python SDK container.

I have created a PR-8107  that
adds relevant library and it fixes test. However I want to first confirm
that we are OK with adding library used for testing to prod container or
whether we want to utilize some other approach for testing.

For a background: we already have tenacity library added this way.

Regards,
--Mikhail

Have feedback ?


Re: KafkaIO Exactly-Once & Flink Runner

2019-03-21 Thread Thomas Weise
Tracked as https://issues.apache.org/jira/browse/BEAM-6879


On Fri, Mar 15, 2019 at 10:13 AM Kenneth Knowles  wrote:

> Yes, the ParDoPayload has to contain most of the information that is on
> DoFnSignature. Everything except the details for feeding the bits to the
> Java DoFn.
>
> Kenn
>
> On Fri, Mar 15, 2019 at 9:59 AM Reuven Lax  wrote:
>
>> I think this attribute needs to be added to the portability protos.
>>
>> On Fri, Mar 15, 2019 at 9:49 AM Thomas Weise  wrote:
>>
>>>
>>>
>>> On Tue, Mar 12, 2019 at 10:13 AM Maximilian Michels 
>>> wrote:
>>>
 > I think that is what Max's PR does. KafkaIO writes entire list of
 values associated with a key in one transaction. So it depends on how Flink
 runner bundles > after a GBK. I would think all of the buffered
 records would be queued. Here, the key is the shard id.

 We do not change the execution logic in case of stable input. Elements
 will still be processed key-wise.

>>>
>>> Wouldn't that require the KafkaEOS to support a different processing
>>> mode where the elements are committed with @FinishBundle? The runner could
>>> then align bundles and checkpointing as needed.
>>>
>>> I'm now also curious how @RequiresStableInput is supposed to work with
>>> portable pipelines? The runner is not able to inspect the ParDo, so this
>>> would need to be provided explicitly as part of the executable stage?
>>>
>>>
>>>




Re: [Announcement] New Website for Beam Summits

2019-03-21 Thread Alexey Romanenko
Great initiative, thanks for creating this!

Btw, are any plans to add there information about previous Beam-related events, 
especially London Beam summit last year?

> On 20 Mar 2019, at 19:30, David Morávek  wrote:
> 
> This is great! Thanks for all of the hard work you're putting into this.
> 
> D.
> 
> On Wed, Mar 20, 2019 at 1:38 PM Maximilian Michels  > wrote:
> Not a bug, it's a feature ;)
> 
> On 20.03.19 07:23, Kenneth Knowles wrote:
> > Very nice. I appreciate the emphasis on coffee [1] [2] [3] though I 
> > suspect there may be a rendering bug.
> > 
> > Kenn
> > 
> > [1] https://beamsummit.org/schedule/2019-06-19?sessionId=1 
> > 
> > [2] https://beamsummit.org/schedule/2019-06-19?sessionId=3 
> > 
> > [3] https://beamsummit.org/schedule/2019-06-19?sessionId=4 
> > 
> > 
> > On Tue, Mar 19, 2019 at 4:43 AM Łukasz Gajowy  >  
> > >> wrote:
> > 
> > Looks great! Thanks for doing this! :)
> > 
> > Łukasz
> > 
> > wt., 19 mar 2019 o 12:30 Maximilian Michels  > 
> > >> napisał(a):
> > 
> > Great stuff! Looking forward to seeing many Beam folks in Berlin.
> > 
> > In case you want to speak at Beam Summit Europe, the Call for
> > Papers is
> > open until April 1: https://sessionize.com/beam-summit-europe-2019 
> > 
> > 
> > -Max
> > 
> > On 19.03.19 09:49, Matthias Baetens wrote:
> >  > Awesome Aizhamal! Great work and thanks for your continued
> > efforts on
> >  > this :) Looking forward to the summit.
> >  >
> >  > On Mon, 18 Mar 2019 at 23:17, Aizhamal Nurmamat kyzy
> >  > mailto:aizha...@google.com> 
> > >
> >  
> >  >  >
> >  > Hello everybody!
> >  >
> >  >
> >  > We are thrilled to announce the launch of beamsummit.org 
> > 
> > >
> >  > > dedicated 
> > to Beam Summits!
> >  >
> >  >
> >  > The current version of the website provides information
> > about the
> >  > upcoming Beam Summit in Europe on June 19-20th, 2019. We
> > will update
> >  > it for the upcoming summits in Asia and North America
> > accordingly.
> >  > You can access all necessary information about the
> > conference theme,
> >  > speakers and sessions, the abstract submission timeline
> > and the
> >  > registration process, the conference venues and much more
> > that you
> >  > will find useful until and during the Beam Summits 2019.
> >  >
> >  >
> >  > We are working to make the website easy to use, so that
> > anyone who
> >  > is organizing a Beam event can rely on it. You can find
> > the code for
> >  > it in Github
> >  > >.
> >  >
> >  > The pages will be updated on a regular basis, but we also
> > love
> >  > hearing thoughts from our community! Let us know if you
> > have any
> >  > questions, comments or suggestions, and help us improve.
> > Also, if
> >  > you are thinking of organizing a Beam event, please feel
> > free to
> >  > reach out  > 
> > >>for 
> > support, and to use the
> >  > code in GitHub as well.
> >  >
> >  >
> >  > We sincerely hope that you like the new Beam Summit
> > website and will
> >  > find it useful for accessing information. Enjoy browsing
> > around!
> >  >
> >  >
> >  > Thanks,
> >  >
> >  > Aizhamal
> >  >
> > 



Re: [spark runner dataset POC] workCount works !

2019-03-21 Thread Alexey Romanenko
Good job! =)
Congrats to all who was involved to move this forward!

Btw, for all who is interested in a progress of work on this runner, I wanted 
to remind that we have #beam-spark channel on Slack where we discuss all 
ongoing questions. Feel free to join!

Alexey

> On 21 Mar 2019, at 15:51, Jean-Baptiste Onofré  wrote:
> 
> Congrats and huge thanks !
> 
> (I'm glad to be one of the little "launcher" to this effort ;) )
> 
> Regards
> JB
> 
> On 21/03/2019 15:47, Ismaël Mejía wrote:
>> This is excellent news. Congrats Etienne, Alexey and the others
>> involved for the great work!
>> On Thu, Mar 21, 2019 at 3:10 PM Etienne Chauchot  
>> wrote:
>>> 
>>> Hi guys,
>>> 
>>> We are glad to announce that the spark runner POC that was re-written from 
>>> scratch using the structured-streaming framework and the dataset API can 
>>> now run WordCount !
>>> 
>>> It is still embryonic. For now it only runs in batch mode and there is no 
>>> fancy stuff like state, timer, SDF, metrics, ... but it is still a major 
>>> step forward !
>>> 
>>> Streaming support work has just started.
>>> 
>>> You can find the branch here: 
>>> https://github.com/apache/beam/tree/spark-runner_structured-streaming
>>> 
>>> Enjoy,
>>> 
>>> Etienne
>>> 
>>> 



Re: [spark runner dataset POC] workCount works !

2019-03-21 Thread Jean-Baptiste Onofré

Congrats and huge thanks !

(I'm glad to be one of the little "launcher" to this effort ;) )

Regards
JB

On 21/03/2019 15:47, Ismaël Mejía wrote:

This is excellent news. Congrats Etienne, Alexey and the others
involved for the great work!

On Thu, Mar 21, 2019 at 3:10 PM Etienne Chauchot  wrote:


Hi guys,

We are glad to announce that the spark runner POC that was re-written from 
scratch using the structured-streaming framework and the dataset API can now 
run WordCount !

It is still embryonic. For now it only runs in batch mode and there is no fancy 
stuff like state, timer, SDF, metrics, ... but it is still a major step forward 
!

Streaming support work has just started.

You can find the branch here: 
https://github.com/apache/beam/tree/spark-runner_structured-streaming

Enjoy,

Etienne




Re: [spark runner dataset POC] workCount works !

2019-03-21 Thread Ismaël Mejía
This is excellent news. Congrats Etienne, Alexey and the others
involved for the great work!

On Thu, Mar 21, 2019 at 3:10 PM Etienne Chauchot  wrote:
>
> Hi guys,
>
> We are glad to announce that the spark runner POC that was re-written from 
> scratch using the structured-streaming framework and the dataset API can now 
> run WordCount !
>
> It is still embryonic. For now it only runs in batch mode and there is no 
> fancy stuff like state, timer, SDF, metrics, ... but it is still a major step 
> forward !
>
> Streaming support work has just started.
>
> You can find the branch here: 
> https://github.com/apache/beam/tree/spark-runner_structured-streaming
>
> Enjoy,
>
> Etienne
>
>


Re: User state cleanup

2019-03-21 Thread Maximilian Michels
Yes, to be extra clear: User state cleanup works correctly in the 
FlinkRunner.


The portable Flink Runner needs a fix to behave the same way. Thanks for 
opening the issue Thomas.


-Max

On 21.03.19 14:25, Thomas Weise wrote:
Created https://issues.apache.org/jira/browse/BEAM-6876 for the portable 
Flink runner issue.



On Wed, Mar 20, 2019 at 11:10 AM Kenneth Knowles > wrote:




On Wed, Mar 20, 2019 at 6:23 AM Maximilian Michels mailto:m...@apache.org>> wrote:

Hi,

I just realized that user state acquired via StateInternals in
the Flink
Runner is not released automatically even when it falls out of the
Window scope. There are ways to work around this, i.e. setting a
cleanup
timer that fires when the Window expires.

Do we expect Runners to perform the cleanup? I would think so since
users do not have control over state once the window expires.


Just to be super clear for anyone not digging in the referenced
code: yes, we do. And the code Reuven referenced is utility code
that a runner can use to facilitate this, or the runner can do its
own thing.

Kenn


Thanks,
Max



[spark runner dataset POC] workCount works !

2019-03-21 Thread Etienne Chauchot
Hi guys,

We are glad to announce that the spark runner POC that was re-written from 
scratch using the structured-streaming
framework and the dataset API can now run WordCount !

It is still embryonic. For now it only runs in batch mode and there is no fancy 
stuff like state, timer, SDF, metrics, 
... but it is still a major step forward ! 

Streaming support work has just started.

You can find the branch here:  
https://github.com/apache/beam/tree/spark-runner_structured-streaming

Enjoy,

Etienne




Re: User state cleanup

2019-03-21 Thread Thomas Weise
Created https://issues.apache.org/jira/browse/BEAM-6876 for the portable
Flink runner issue.


On Wed, Mar 20, 2019 at 11:10 AM Kenneth Knowles  wrote:

>
>
> On Wed, Mar 20, 2019 at 6:23 AM Maximilian Michels  wrote:
>
>> Hi,
>>
>> I just realized that user state acquired via StateInternals in the Flink
>> Runner is not released automatically even when it falls out of the
>> Window scope. There are ways to work around this, i.e. setting a cleanup
>> timer that fires when the Window expires.
>>
>> Do we expect Runners to perform the cleanup? I would think so since
>> users do not have control over state once the window expires.
>>
>
> Just to be super clear for anyone not digging in the referenced code: yes,
> we do. And the code Reuven referenced is utility code that a runner can use
> to facilitate this, or the runner can do its own thing.
>
> Kenn
>
>
>>
>> Thanks,
>> Max
>>
>>


PubsubIO and projectId

2019-03-21 Thread Jan Lukavský

Hi,

I have come across an issue using PubsubIO with flink runner. The 
problem is described at [1]. I also created PR for this: [2], but there 
are some doubts described in comment in the JIRA issue. Would someone 
have time to walk through it and/or provide some insights?


Thanks,

 Jan

[1] https://issues.apache.org/jira/browse/BEAM-6873.

[2] https://github.com/apache/beam/pull/8105