from:"Pablo Estrada"

Re: [spark runner dataset POC] workCount works !

2019-03-21 Thread Pablo Estrada

This is pretty cool. Thanks for working on this and for sharing:)
Best
-P.

On Thu, Mar 21, 2019, 8:18 AM Alexey Romanenko 
wrote:

> Good job! =)
> Congrats to all who was involved to move this forward!
>
> Btw, for all who is interested in a progress of work on this runner, I
> wanted to remind that we have #beam-spark channel on Slack where we discuss
> all ongoing questions. Feel free to join!
>
> Alexey
>
> > On 21 Mar 2019, at 15:51, Jean-Baptiste Onofré  wrote:
> >
> > Congrats and huge thanks !
> >
> > (I'm glad to be one of the little "launcher" to this effort ;) )
> >
> > Regards
> > JB
> >
> > On 21/03/2019 15:47, Ismaël Mejía wrote:
> >> This is excellent news. Congrats Etienne, Alexey and the others
> >> involved for the great work!
> >> On Thu, Mar 21, 2019 at 3:10 PM Etienne Chauchot 
> wrote:
> >>>
> >>> Hi guys,
> >>>
> >>> We are glad to announce that the spark runner POC that was re-written
> from scratch using the structured-streaming framework and the dataset API
> can now run WordCount !
> >>>
> >>> It is still embryonic. For now it only runs in batch mode and there is
> no fancy stuff like state, timer, SDF, metrics, ... but it is still a major
> step forward !
> >>>
> >>> Streaming support work has just started.
> >>>
> >>> You can find the branch here:
> https://github.com/apache/beam/tree/spark-runner_structured-streaming
> >>>
> >>> Enjoy,
> >>>
> >>> Etienne
> >>>
> >>>
>
>

PipelineOptions at execution time from DirectRunner

2019-03-21 Thread Pablo Estrada

Hi all,
The DirectRunner does not seem to support RuntimeValueProvider. Is there a
suggestion for DirectRunner pipelines to access arguments passed in as
pipeline options(but not necessarily passed explicitly by users) at
pipeline execution time?

Getting it as pcoll.pipeline.options in the expand(self, pcoll) call is a
possiblity, but it seems like that's not ideal. Any other suggestions?

Should we simply support RuntimeValueProvider in direct runner?
Best
-P.

Re: PipelineOptions at execution time from DirectRunner

2019-03-21 Thread Pablo Estrada

Thanks Ahmet! These are illustrative explanations.

I still wonder about one question:

>
>> Getting it as pcoll.pipeline.options in the expand(self, pcoll) call is a
>> possiblity, but it seems like that's not ideal. Any other suggestions?
>>
> Is this an appropriate way of obtaining an option that is not explicitly
passed by the user? It prints a warning.

"What's the context?" - I'm working on a transform that writes to big
query, and table destinations can come in the form "dataset.table", or
"project:dataset.table". Because these are parsed at runtime (destinations
are dynamic), the ptransform checks if a project was provided in
pipelineoptions via value providers.

Thanks!
-P.



>> Should we simply support RuntimeValueProvider in direct runner?
>>
>
> This is a bit tricky for python. Because it is possible to run multiple
> pipeline with DirectRunner in the same process (e.g. call run and do not
> block on results.) RuntimeValueProvider works by setting a global variable
> and in the case of direct runner multiple pipelines could share the same
> process, it gets tricky to support this.
>
>
>> Best
>> -P.
>>
>

Re: [ANNOUNCE] New committer announcement: Mark Liu

2019-03-24 Thread Pablo Estrada

Yeaah  Mark! : ) Congrats : D

On Sun, Mar 24, 2019 at 10:32 PM Yifan Zou  wrote:

> Congratulations Mark!
>
> On Sun, Mar 24, 2019 at 10:25 PM Connell O'Callaghan 
> wrote:
>
>> Well done congratulations Mark!!!
>>
>> On Sun, Mar 24, 2019 at 10:17 PM Robert Burke  wrote:
>>
>>> Congratulations Mark! 🎉
>>>
>>> On Sun, Mar 24, 2019, 10:08 PM Valentyn Tymofieiev 
>>> wrote:
>>>
 Congratulations, Mark!

 Thanks for your contributions, in particular for your efforts to
 parallelize test execution for Python SDK and increase the speed of Python
 precommit checks.

 On Sun, Mar 24, 2019 at 9:40 PM Kenneth Knowles 
 wrote:

> Hi all,
>
> Please join me and the rest of the Beam PMC in welcoming a new
> committer: Mark Liu.
>
> Mark has been contributing to Beam since late 2016! He has proposed
> 100+ pull requests. Mark was instrumental in expanding test and
> infrastructure coverage, especially for Python. In consideration of
> Mark's contributions, the Beam PMC trusts Mark with the responsibilities 
> of
> a Beam committer [1].
>
> Thank you, Mark, for your contributions.
>
> Kenn
>
> [1] https://beam.apache.org/contribute/become-a-committer/#an-apache-
> beam-committer
>

Re: Writing bytes to BigQuery with beam

2019-03-25 Thread Pablo Estrada

+Chamikara Jayalath  with the new BigQuery sink,
schema autodetection is supported (it's a very simple thing to have). Do
you think we should not have it?
Best
-P.

On Mon, Mar 25, 2019 at 11:01 AM Chamikara Jayalath 
wrote:

>
>
> On Mon, Mar 25, 2019 at 2:03 AM Juta Staes  wrote:
>
>>
>> On Mon, 25 Mar 2019 at 06:15, Valentyn Tymofieiev 
>> wrote:
>>
>>> We received feedback on https://issuetracker.google.com/issues/129006689 -
>>> BQ developers say that schema identification is done and they discourage to
>>> use schema autodetection in tables using BYTES. In light of this, I think
>>> may be fair to recommend Beam users to specify BQ schemas as well when they
>>> interact with BQ, and call out that writing binary data to BQ will likely
>>> fail unless schema is specified. Does that make sense?
>>>
>>
>> Given that schema autodetect does not work for bytes I think it is indeed
>> a good solution to require users to specify BQ schemas as well when they
>> write to BQ
>>
>> So new summary:
>> 1. Beam will base64-encode raw bytes, before passing them to BQ over rest
>> API. This will be a change in behavior for Python 2 (for good reasons).
>> 2. When reading data from BQ, all fields of type BYTES will be
>> base64-decoded.
>> 3. Beam will send an API call to BigQuery to get table schema, whenever
>> schema is not supplied, to work around
>> https://issuetracker.google.com/issues/129006689. Beam will require
>> users to specify the schema when writing bytes to BQ.
>>
>
> I'm not sure why we reached this conclusion. We (Beam) does not use BQ
> schema auto detection feature currently.  So why not just send an API
> signal to get the schema when users are writing to existing tables ? Also,
> even if we decide to support schema auto detection in the future we will
> not be able to support this for BYTEs type (due to the restriction by BQ).
>
>
>> Thanks all for your input on this!
>> Juta
>>
>>

Re: Writing bytes to BigQuery with beam

2019-03-26 Thread Pablo Estrada

Sure, we can make users explicitly ask for schema autodetection, instead of
it being the default when no schema is provided. I think that's reasonable.


On Mon, Mar 25, 2019, 7:19 PM Valentyn Tymofieiev 
wrote:

> Thanks everyone for input on this thread. I think there is a confusion
> between not specifying the schema, and asking BigQuery to do schema
> autodetection. This is not the same thing, however in recent changes to BQ
> IO that happened after 2.11 release, we are forcing schema autodetection,
> when schema is not specified, see: [1].
>
> I think we need to revise this ahead of 2.12. It may be better if users
> explicitly opt-in to schema autodetection if they wish. Autodetection is an
> approximation, and in particular, as we figured out in this thread, it does
> not work correctly for BYTES data.
>
> I suspect that if we disable schema autodetection, and/or make previous
> implementation of BQ sink a default option, we will be able to write BYTES
> data to a previously created BQ table without specifying the schema, and
> making a call to BQ to fetch the schema won't be necessary. We'd need to
> verify that.
>

> Another interesting note, as per Juta's analysis
> <https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing>,
> google-cloud-bigquery client does not require additional base64 encoding
> for bytes, so once we migrate to use this client, base64 encoding/decoding
> of Bytes data won't be necessary in Beam.
>
> [1]
> https://github.com/apache/beam/blob/0b71f541e93f3bd69af87ad8a6db46ccb4a01ddc/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L321
> .
> [2]
> https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit#bookmark=id.7pfrsz1c8hcj
>
> On Mon, Mar 25, 2019 at 2:26 PM Chamikara Jayalath 
> wrote:
>
>>
>>
>> On Mon, Mar 25, 2019 at 2:16 PM Pablo Estrada  wrote:
>>
>>> +Chamikara Jayalath  with the new BigQuery sink,
>>> schema autodetection is supported (it's a very simple thing to have). Do
>>> you think we should not have it?
>>> Best
>>> -P.
>>>
>>
>> Ah good to know. But IMO users should be able to write to existing tables
>> without specifying a schema (when CEATE_DISPOSITION is CREATE_NEVER for
>> example). How do users enable schema auto-detection ? Probably this should
>> not be enabled by default and we should clearly advertise that bytes type
>> is not supported (or support it with extra information). Just my 2 cents.
>>
>> Thanks,
>> Cham
>>
>>
>>>
>>> On Mon, Mar 25, 2019 at 11:01 AM Chamikara Jayalath <
>>> chamik...@google.com> wrote:
>>>
>>>>
>>>>
>>>> On Mon, Mar 25, 2019 at 2:03 AM Juta Staes  wrote:
>>>>
>>>>>
>>>>> On Mon, 25 Mar 2019 at 06:15, Valentyn Tymofieiev 
>>>>> wrote:
>>>>>
>>>>>> We received feedback on
>>>>>> https://issuetracker.google.com/issues/129006689 - BQ developers say
>>>>>> that schema identification is done and they discourage to use schema
>>>>>> autodetection in tables using BYTES. In light of this, I think may be 
>>>>>> fair
>>>>>> to recommend Beam users to specify BQ schemas as well when they interact
>>>>>> with BQ, and call out that writing binary data to BQ will likely fail
>>>>>> unless schema is specified. Does that make sense?
>>>>>>
>>>>>
>>>>> Given that schema autodetect does not work for bytes I think it is
>>>>> indeed a good solution to require users to specify BQ schemas as well when
>>>>> they write to BQ
>>>>>
>>>>> So new summary:
>>>>> 1. Beam will base64-encode raw bytes, before passing them to BQ over
>>>>> rest API. This will be a change in behavior for Python 2 (for good 
>>>>> reasons).
>>>>> 2. When reading data from BQ, all fields of type BYTES will be
>>>>> base64-decoded.
>>>>> 3. Beam will send an API call to BigQuery to get table schema,
>>>>> whenever schema is not supplied, to work around
>>>>> https://issuetracker.google.com/issues/129006689. Beam will require
>>>>> users to specify the schema when writing bytes to BQ.
>>>>>
>>>>
>>>> I'm not sure why we reached this conclusion. We (Beam) does not use BQ
>>>> schema auto detection feature currently.  So why not just send an API
>>>> signal to get the schema when users are writing to existing tables ? Also,
>>>> even if we decide to support schema auto detection in the future we will
>>>> not be able to support this for BYTEs type (due to the restriction by BQ).
>>>>
>>>>
>>>>> Thanks all for your input on this!
>>>>> Juta
>>>>>
>>>>>

Re: GSoC Project Proposal

2019-03-29 Thread Pablo Estrada

Hi Tanay,
thanks for submitting the proposal. I'll try to take a look on Tuesday to
give some feedback.
I encourage others to take a look as well : )
Best
-P.

On Thu, Mar 28, 2019 at 11:58 AM Tanay Tummalapalli 
wrote:

> Hi everyone,
>
> I am applying for Google Summer of Code with Apache Beam this year.
> I have chosen BEAM-6611  -
> "A Python Sink for BigQuery with File Loads in Streaming" as my project.
> The draft of my proposal
> 
>  is
> ready. I would love to get feedback and criticism from you so I can submit
> the best possible proposal.
>
> PFA my project proposal.
>
> Thank You
> Tanay Tummalapalli
> LinkedIn  | GitHub
> 
>
>
>  GSoC Project proposal - T. Tanay
> 
>
>
>

Re: Build blocking on

2019-03-31 Thread Pablo Estrada

Hi Michael,
I wrote that test and much of that code. I'm quite sorry about the trouble.
The test should use mocks and not hang when it's missing GCP dependencies.
That sounds like a bug in the test. We can deactivate it while I figure out
what's going wrong..
Best
-P.

On Sat, Mar 30, 2019, 2:55 PM Michael Luckey  wrote:

> After digging a bit deeper, I was able to verify, that those tests block
> on authorization to GCP.
>
> Seems that, as I do not have any credentials set, and underlying oauth2
> falls back to some local mode. This seems to start a webserver on port 8080
> and waiting there forever. Accessing that port forwards to some google, but
> fails also miserably.
>
> Running
>
> python setup.py nosetests --tests
>>  
>> apache_beam.io.gcp.bigquery_file_loads_test:TestBigQueryFileLoads.test_records_traverse_transform_with_mocks
>
>
> and hitting 'Ctrl-C' after it got stuck, results in following output:
>
> 'KeyboardInterrupt [while running
>> \'WriteToBigQuery/BigQueryBatchFileLoads/RemoveTempTables/Delete\']\n
>> Your browser has been opened to visit:
>>
>> https://accounts.google.com/o/oauth2/v2/auth?scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fbigquery+https%3A%
>> If your browser is on a different machine then exit and re-run this
>> application with the command-line parameter
>>   --noauth_local_webserver
>> Failed to find "code" in the query parameters of the redirect.
>> Invalid authorization: Try running with --noauth_local_webserver.
>
>
> I am a bit lost here on how to proceed.
>
>
> On Tue, Mar 26, 2019 at 11:48 PM Michael Luckey 
> wrote:
>
>>
>>
>> On Tue, Mar 26, 2019 at 11:18 PM Mikhail Gryzykhin 
>> wrote:
>>
>>> I believe what happens is that testPy2Gcp actually runs integration
>>> tests that try to connect to GCP.
>>>
>>
>> Actually I was hoping for an explanation like this. Any suggestion how I
>> could confirm that on my behalf?
>>
>>
>>> Without having GCP cluster and configuration on your machine I'd expect
>>> these tests to fail.
>>>
>>
>> Hmm... here I am actually unsure, what would be the best to handle such
>> cases.
>>
>> If I understand correctly, we currently skip some tests which do not meet
>> expectations, kind of 'can not run on your arch' thingies... So I am
>> undecided, whether I d prefer those tests to be skipped if gcp
>> configuration is missing
>>
>> pro
>> * dev is still able to run the tests (whichever task they are associated
>> with) without having to separate the failures out. For instance, these
>> 'testPy2Gcp' does actually execute 'some tests' - which might be already
>> covered by some other calls... But I definitely do not like the idea, to
>> put the burden on the developer to track which tasks/tests might be
>> executed on local machine. Unless this distinction is really coarse - and
>> pre/postcommit is something I really would like to be able to run locally...
>>
>>
>> con
>> * we definitely need to make sure, those tests are not accidentally
>> skipped on CI servers.
>>
>>
>>>
>>> I'd say we should remove testPy2Gcp task from "build" task and
>>> explicitly keep it as integration test.
>>>
>>> --Mikhail
>>>
>>>
>>> On Tue, Mar 26, 2019 at 3:12 PM Michael Luckey 
>>> wrote:
>>>


 On Tue, Mar 26, 2019 at 10:29 PM Udi Meiri  wrote:

> Luckey, I couldn't recreate your issue, but I still haven't done a
> full build.
> I created a new GCE VM with using the ubuntu-1804-bionic-v20190212a
> image (n1-standard-4 machine type).
>
> Ran the following:
> sudo apt-get update
> sudo apt-get install python-pip
> sudo apt-get install python-virtualenv
> git clone https://github.com/apache/beam.git
> cd beam
> ./gradlew :beam-sdks-python:testPy2Gcp
> [failed: no JAVA_HOME]
> sudo apt-get install openjdk-8-jdk
> ./gradlew :beam-sdks-python:testPy2Gcp
>
> Got: BUILD SUCCESSFUL in 7m 52s
>

 Nice. Thanks a lot for your help here.

 If I understand correctly, this VM is already located within gcp. Could
 it already have some setup, which needs to be done on 'my' VM? For instance
 I was contemplating about that test trying 'to call home', but as I am
 (unfortunately ;) no googler and do not have any gcp specific setup, fails
 here but misses to timeout? This is just some weird assumption, did not yet
 look into the actual implementation.

 Which I seemingly need to do here :(


> Then I tried:
> ./gradlew build
>
> And ran out of disk space. :) (beam/ is taking 4.5G and the VM boot
> disk is 10G total)
>

 Ouch :D


>
> On Tue, Mar 26, 2019 at 1:35 PM Robert Burke 
> wrote:
>
>> Michael, your concern is reasonable, especially with the experience
>> with python, though that does help me bootstrap this work. :)
>>
>> The go tools provide caching and avoid redoing work if the source
>> files haven't changed. This applies most particularly

Re: Implementation an S3 file system for python SDK

2019-04-03 Thread Pablo Estrada

Hi Pasan!
Thanks for the proposal. I'll try to take a look in the next few hours and
give some feedback.
Best
--P.

On Wed, Apr 3, 2019, 8:53 AM Ahmet Altay  wrote:

> +Pablo Estrada 
>
> On Wed, Apr 3, 2019 at 8:46 AM Lukasz Cwik  wrote:
>
>> +dev 
>>
>> On Wed, Apr 3, 2019 at 2:03 AM Pasan Kamburugamuwa <
>> pasankamburugamu...@gmail.com> wrote:
>>
>>> Hi ,
>>>
>>> I have completed a proposal to implementation an S3 file system for
>>> python SDK for the google summer of Code 2019. Please can you guys review
>>> this proposal and if there is any issues with this proposal, let me know.
>>> Here is the link to the project proposal -
>>>
>>> https://docs.google.com/document/d/1i_PoIrbmhNgwKCS1TYWC28A9RsyZQFsQCJic3aCXO-8/edit?usp=sharing
>>>
>>> Thank you
>>> Pasan Kamburugamuwa
>>>
>>>

Re: ParDo Execution Time stat is always 0

2019-04-04 Thread Pablo Estrada

Hello guys!
Alex, Mikhail and Ryan are working on support for metrics in the
portability framework. The support on the SDK is pretty advanced AFAIK*,
and the next step is to get the metrics back into the runner. Lukazs and
myself are working on a project that depends on this too, so I'm adding
everyone so we can get an idea of what's missing.

I believe:
- User metrics are fully wired up in the SDK
- State sampler (timing) metrics are wired up as well (is that right, +Alex
Amato ?)
- Work is ongoing to send the updates back to Flink.
- What is the plan for making metrics queriable from Flink? +Ryan Williams


Thanks!
-P.



On Wed, Apr 3, 2019 at 12:02 PM Thomas Weise  wrote:

> I believe this is where the metrics are supplied:
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/operations.py
>
> git grep process_bundle_msecs   yields results for dataflow worker only
>
> There isn't any test coverage for the Flink runner:
>
>
> https://github.com/apache/beam/blob/d38645ae8758d834c3e819b715a66dd82c78f6d4/sdks/python/apache_beam/runners/portability/flink_runner_test.py#L181
>
>
>
> On Wed, Apr 3, 2019 at 10:45 AM Akshay Balwally 
> wrote:
>
>> Should have added- I'm using Python sdk, Flink runner
>>
>> On Wed, Apr 3, 2019 at 10:32 AM Akshay Balwally 
>> wrote:
>>
>>> Hi,
>>> I'm hoping to get metrics on the amount of time spent on each operator,
>>> so it seams like the stat
>>>
>>>
>>> {organization_specific_prefix}.operator.beam-metric-pardo_execution_time-process_bundle_msecs-v1.gauge.mean
>>>
>>> would be pretty helpful. But in practice, this stat always shows 0,
>>> which I interpret as 0 milliseconds spent per bundle, which can't be
>>> correct (other stats show that the operators are running, and timers within
>>> the operators show more reasonable times). Is this a known bug?
>>>
>>>
>>> --
>>> *Akshay Balwally*
>>> Software Engineer
>>> 937.271.6469 <+19372716469>
>>> [image: Lyft] 
>>>
>>
>>
>> --
>> *Akshay Balwally*
>> Software Engineer
>> 937.271.6469 <+19372716469>
>> [image: Lyft] 
>>
>

Re: test_split_crazy_sdf broken in python presubmit. 'DataInputOperation' object has no attribute 'index'

2019-04-04 Thread Pablo Estrada

FWIW I have seen this being flaky in a couple different PRs, and passing on
the second PreCommit run. I don't know if it's due to the unhealthy worker
machines, or if it's actually flaky.
Best
-P.

On Thu, Apr 4, 2019 at 10:03 AM Lukasz Cwik  wrote:

> I think its a Jenkins executor issue because
> https://github.com/apache/beam/pull/8217 passed its test just now.
>
> On Thu, Apr 4, 2019 at 10:02 AM Lukasz Cwik  wrote:
>
>> I have tried running this test at head locally and have not gotten this
>> failure because I also had a different failure related to the
>> .with_complete method not being available. I'm not yet sure whether this is
>> a Jenkins executor issue or an actual code issue.
>>
>> On Thu, Apr 4, 2019 at 9:17 AM Alex Amato  wrote:
>>
>>> https://jira.apache.org/jira/browse/BEAM-7006
>>>
>>>
>>> https://builds.apache.org/job/beam_PreCommit_Python_Phrase/331/testReport/junit/apache_beam.runners.portability.fn_api_runner_test/FnApiRunnerSplitTest/test_split_crazy_sdf_2/
>>>
>>> Traceback (most recent call last): File
>>> "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/sdks/python/test-suites/tox/py35/build/srcs/sdks/python/apache_beam/runners/worker/sdk_worker.py",
>>> line 157, in _execute response = task() File
>>> "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/sdks/python/test-suites/tox/py35/build/srcs/sdks/python/apache_beam/runners/worker/sdk_worker.py",
>>> line 216, in  lambda: self.progress_worker.do_instruction(request),
>>> request) File
>>> "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/sdks/python/test-suites/tox/py35/build/srcs/sdks/python/apache_beam/runners/worker/sdk_worker.py",
>>> line 312, in do_instruction request.instruction_id) File
>>> "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/sdks/python/test-suites/tox/py35/build/srcs/sdks/python/apache_beam/runners/worker/sdk_worker.py",
>>> line 354, in process_bundle_split
>>> process_bundle_split=processor.try_split(request)) File
>>> "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/sdks/python/test-suites/tox/py35/build/srcs/sdks/python/apache_beam/runners/worker/bundle_processor.py",
>>> line 588, in try_split desired_split.estimated_input_elements) File
>>> "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/sdks/python/test-suites/tox/py35/build/srcs/sdks/python/apache_beam/runners/worker/bundle_processor.py",
>>> line 144, in try_split if total_buffer_size < self.index + 1:
>>> AttributeError: 'DataInputOperation' object has no attribute 'index'
>>>
>>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-04 Thread Pablo Estrada

Could this be a backwards-incompatible change that would break pipelines
from upgrading? If they have data in-flight in between operators, and we
change the coder, they would break?
I know very little about coders, but since nobody has mentioned it, I
wanted to make sure we have it in mind.
-P.

On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles  wrote:

> Agree that a coder URN defines the encoding. I see that string UTF-8 was
> added to the proto enum, but it needs a written spec of the encoding.
> Ideally some test data that different languages can use to drive compliance
> testing.
>
> Kenn
>
> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke  wrote:
>
>> String UTF8 was recently added as a "standard coder " URN in the protos,
>> but I don't think that developed beyond Java, so adding it to Python would
>> be reasonable in my opinion.
>>
>> The Go SDK handles Strings as "custom coders" presently which for Go are
>> always length prefixed (and reported to the Runner as LP+CustomCoder). It
>> would be straight forward to add the correct handling for strings, as Go
>> natively treats strings as UTF8.
>>
>>
>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee  wrote:
>>
>>> Hi all,
>>>
>>> It looks like UTF-8 String Coder in Java and Python SDKs uses different
>>> encoding schemes. StringUtf8Coder in Java SDK puts the varint length of the
>>> input string before actual data bytes however StrUtf8Coder in Python SDK
>>> directly encodes the input string to bytes value. For the last few weeks,
>>> I've been testing and fixing cross-language IO transforms and this
>>> discrepancy is a major blocker for me. IMO, we should unify the encoding
>>> schemes of UTF8 strings across the different SDKs and make it a standard
>>> coder. Any thoughts?
>>>
>>> Thanks,
>>>
>>

[review?] WordCount in Kotlin

2019-04-04 Thread Pablo Estrada

Hello all,
as community member has been very kind to contribute a Kotlin translation
of the WordCount pipeline[1]. The documentation, tests, and gradle
structure for it is very good, so I am happy to merge, but since this code
will become our first Kotlin "documentation"/entrypoint, I wanted to be
cautious.
So if anyone wants to take a look to review the change, please do. I will
merge this in a couple days.
Thanks!
-P.

[1] https://github.com/apache/beam/pull/8034

Re: Unexpected TestStream behavior when testing stateful DoFn

2019-04-04 Thread Pablo Estrada

I saw similar issues. I'll try to debug this tomorrow. It'll take some time
to study the code, so we'll see. Assigning the issue to me.

On Fri, Mar 29, 2019 at 6:43 AM Steve Niemitz  wrote:

> This reminds me of a bug I had filed for the direct runner a few weeks
> ago, except I was running into it in Java and not python.
> https://issues.apache.org/jira/browse/BEAM-6813
>
> On Thu, Mar 28, 2019 at 9:04 PM Ahmet Altay  wrote:
>
>> I agree it looks like a bug. Filed
>> https://issues.apache.org/jira/browse/BEAM-6934
>>
>>
>>
>> On Thu, Mar 28, 2019 at 5:18 PM Reuven Lax  wrote:
>>
>>> This looks like a bug to me.
>>>
>>> On Thu, Mar 28, 2019 at 2:52 PM Amar Pai  wrote:
>>>
 Hi,

 I'm running into some unexpected behavior when trying to unit test a
 stateful DoFn that uses watermark timers as well as bag state.  I'm
 following the example here:
 https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/userstate_test.py#L333

 Expected behavior:
 When using TestStream, if a stateful DoFn adds value 'foo' to BagState
 then sets two watermark timers t1 and t2, where t1 clears the bag state and
 t2 reads from bag state and emits the contents, if t1.time < t2.time then
 nothing should get emitted when the TestPipeline is run. (bag state should
 be cleared by timer at t1 before it is read from by timer at t2)

 Actual behavior:
 For the scenario described above, results get emitted despite t1.time
 being less than t2.time.
 I've created a gist with a demonstration of the problem:
 https://gist.github.com/jcruelty/3bf5ce5865110372a2d1650b1421cde1

 Is this a bug?  Would appreciate your thoughts.
 cheers

 Amar

>>>

Re: Implementation an S3 file system for python SDK - Updated

2019-04-08 Thread Pablo Estrada

Currently, Pasan is working on a design for adding a couple implementations
to the Filesystem interface in Python, and it's not necessary to consider
SDF here. IMHO.

On the other hand, Python's fileio[1] could probably use SDF-based
improvements to split when many files are being matched.
Best
-P.

On Mon, Apr 8, 2019 at 10:00 AM Alex Amato  wrote:

> +Lukasz Cwik , +Boyuan Zhang , +Lara
> Schmidt 
>
> Should splittable DoFn be considered in this design? In order to split and
> scale the source step properly?
>
> On Mon, Apr 8, 2019 at 9:11 AM Ahmet Altay  wrote:
>
>> +dev  +Pablo Estrada  +Chamikara
>> Jayalath  +Udi Meiri 
>>
>> Thank you Pasan. I quickly looked at the proposal and it looks good.
>> Added a few folks who could offer additional feedback.
>>
>> On Mon, Apr 8, 2019 at 12:13 AM Pasan Kamburugamuwa <
>> pasankamburugamu...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have updated the project proposal according to the given feedback. So
>>> can you guys check my proposal again and give me your feedback about
>>> corrections I have done.
>>>
>>> Here is the link to the updated project proposal
>>>
>>> https://docs.google.com/document/d/1i_PoIrbmhNgwKCS1TYWC28A9RsyZQFsQCJic3aCXO-8/edit?usp=sharing
>>>
>>> Thank you
>>> Pasan Kamburugamuwa
>>>
>>

Re: Is there an integration test available for filesystem checking

2019-04-08 Thread Pablo Estrada

I recommend you send these questions to the dev@ list Pasan.

Have you looked at the *_test.py files corresponding to each one of the
file systems? Are they all mocking their access to GCS?
Best
-P.

On Sun, Apr 7, 2019 at 11:12 PM Pasan Kamburugamuwa <
pasankamburugamu...@gmail.com> wrote:

> Hello,
>
> I am currently updating the project proposal which I have already sent to
> the community to get the feedback. So I am having a problem with it, I want
> to know is there any integration testing available for test the filesystem.
>
> Thanks
> pasan kamburugamuwa
>

[QUESTION] Should DoFns be able to get the watermark?

2019-04-09 Thread Pablo Estrada

I am experimenting with state / timers in Python. As I look at the
DoFnProcessParams[1], I see that it's possible for a DoFn to receive
several arguments (e.g. Timers, Side Inputs, etc). Also the Watermark via
WatermarkReporterParam.

I see that this parameter is not handled by runners when filling up the
arguments for a DoFn[2][3]. So, as far as I can tell, DoFns are not
currently able to get the watermark.

Is this a bug, or is it intentional? Perhaps there's another way to find
out the watermark for a DoFn?

Best
-P.

[1]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/core.py#L381-L390

[2]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/common.py#L477-L488
[3]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/common.py#L605-L620

Re: [QUESTION] Should DoFns be able to get the watermark?

2019-04-09 Thread Pablo Estrada

hi Luke,
thanks for the prompt reply: )

That makes sense. I think I'll go back to my cave to read a bunch about
streaming. : )

I was looking for this to try to write a sequence generator for Python in
streaming, and I was trying to debug what was going on. I was trying to
allow the DoFn to receive a watermark reported by the upstream source. (...
does that answer "which watermark?"... I am not sure that it does... but
maybe..).

Do you think it's a reasonable use case for DoFns to know what the upstream
watermark is?
I hope that makes at least a some sense... : )

If it doesn't make sense, feel free to ignore, and I'll go do my readings.
Thanks!
-P.

On Tue, Apr 9, 2019 at 1:44 PM Lukasz Cwik  wrote:

> WatermarkReporterParam is about reporting the watermark. The main usecase
> is for SplittableDoFns to be able to report the data watermark.
>
> The watermark is per input and output of a DoFn. Also each bundle being
> processed has its local watermarks while the runner computes the global
> watermark. The runners watermark could be per key, or key range or global
> across all keys.
>
> There is no runner agnostic way to read the watermark today. Is there a
> usecase you are targeting that would help from having access to the
> watermark (also, which watermark?)?
>
>
> On Tue, Apr 9, 2019 at 1:28 PM Pablo Estrada  wrote:
>
>> I am experimenting with state / timers in Python. As I look at the
>> DoFnProcessParams[1], I see that it's possible for a DoFn to receive
>> several arguments (e.g. Timers, Side Inputs, etc). Also the Watermark via
>> WatermarkReporterParam.
>>
>> I see that this parameter is not handled by runners when filling up the
>> arguments for a DoFn[2][3]. So, as far as I can tell, DoFns are not
>> currently able to get the watermark.
>>
>> Is this a bug, or is it intentional? Perhaps there's another way to find
>> out the watermark for a DoFn?
>>
>> Best
>> -P.
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/core.py#L381-L390
>>
>> [2]
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/common.py#L477-L488
>> [3]
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/common.py#L605-L620
>>
>

Re: Updates on Beam Jenkins

2019-04-09 Thread Pablo Estrada

Thanks for the updates Yifan. I am sure this process has been difficult,
and I appreciate the good communication, and that this didn't really affect
the workflow of anyone to validate the new setup for nodes.

I imagine that once we move to dockerizing the testing environment, it will
be much simpler to restart machines that are having trouble?
Thanks again!
-P.

On Tue, Apr 9, 2019 at 2:23 PM Yifan Zou  wrote:

> Hello,
>
> I have some good news about our Jenkins nodes. We're now having 7 new
> nodes  online named as
> "apache-beam-jenkins-[1, 2, 4, 7, 8, 9, 12]", which substitute the old
> broken agents "beam[1, 2, 4, 7, 8, 9, 12]". This helps to reduce the job
> waiting queue that let your tests getting executed quickly. They're
> starting up and accepting jobs. There is no action needed on your end.
>
> I'll keep the remaining old agents running for one more week just in case
> it won't fully block the development works if any unexpected issues on the
> new agents. Once the new agents are stable and verified, I'll move forward
> to connect the rest agents and remove old set. The playbook is also on the
> way.
>
> For more background and information of the Jenkins updates, please see:
> One pager:
> https://docs.google.com/document/d/1c38IPrF94PZC-ItGZgmAgAKrgmC1MGA6N6nkK0cL6L4/edit?ts=5ca54b3e#
> Environment verification sheet:
> https://docs.google.com/spreadsheets/d/1MDL6vy_0iaFSZeWQ-4JWKlRiZ5WFdDVjJh6Xvczgld0/edit#gid=0
> Previous Thread on dev@:
> https://lists.apache.org/thread.html/7b9863b241b37484f321d8812e2ad10d8f054ec720aec4b98efe0446@%3Cdev.beam.apache.org%3E
>
> Thanks.
>
> Regards.
> Yifan Zou
>
>
>

Re: [QUESTION] Should DoFns be able to get the watermark?

2019-04-09 Thread Pablo Estrada

sunds good. Thanks guys <3

On Tue, Apr 9, 2019 at 3:19 PM Lukasz Cwik  wrote:

> UnboundedSources and SplittableDoFns report watermarks which the runner
> uses to compute how much the watermark could advance if it processed some
> outstanding work. But it is always upto the runner to choose when the
> watermark advances. The runner could process each work item in watermark
> priority order and advance the watermark in small increments or could
> process many work items and then advance the watermark a lot. (Note that
> the BoundedSources API doesn't allow for reporting the watermark and it
> starts at Beam's concept of START OF TIME and advances in one step to
> Beam's concept of END OF TIME).
>
> You might be able to write what you want with an event based timer. Kenn
> wrote (2?) blog posts on state and timers that have some pretty good
> explanations and examples.
>
> On Tue, Apr 9, 2019 at 2:27 PM Pablo Estrada  wrote:
>
>> hi Luke,
>> thanks for the prompt reply: )
>>
>> That makes sense. I think I'll go back to my cave to read a bunch about
>> streaming. : )
>>
>> I was looking for this to try to write a sequence generator for Python in
>> streaming, and I was trying to debug what was going on. I was trying to
>> allow the DoFn to receive a watermark reported by the upstream source. (...
>> does that answer "which watermark?"... I am not sure that it does... but
>> maybe..).
>>
>> Do you think it's a reasonable use case for DoFns to know what the
>> upstream watermark is?
>> I hope that makes at least a some sense... : )
>>
>> If it doesn't make sense, feel free to ignore, and I'll go do my readings.
>> Thanks!
>> -P.
>>
>> On Tue, Apr 9, 2019 at 1:44 PM Lukasz Cwik  wrote:
>>
>>> WatermarkReporterParam is about reporting the watermark. The main
>>> usecase is for SplittableDoFns to be able to report the data watermark.
>>>
>>> The watermark is per input and output of a DoFn. Also each bundle being
>>> processed has its local watermarks while the runner computes the global
>>> watermark. The runners watermark could be per key, or key range or global
>>> across all keys.
>>>
>>> There is no runner agnostic way to read the watermark today. Is there a
>>> usecase you are targeting that would help from having access to the
>>> watermark (also, which watermark?)?
>>>
>>>
>>> On Tue, Apr 9, 2019 at 1:28 PM Pablo Estrada  wrote:
>>>
>>>> I am experimenting with state / timers in Python. As I look at the
>>>> DoFnProcessParams[1], I see that it's possible for a DoFn to receive
>>>> several arguments (e.g. Timers, Side Inputs, etc). Also the Watermark via
>>>> WatermarkReporterParam.
>>>>
>>>> I see that this parameter is not handled by runners when filling up the
>>>> arguments for a DoFn[2][3]. So, as far as I can tell, DoFns are not
>>>> currently able to get the watermark.
>>>>
>>>> Is this a bug, or is it intentional? Perhaps there's another way to
>>>> find out the watermark for a DoFn?
>>>>
>>>> Best
>>>> -P.
>>>>
>>>> [1]
>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/core.py#L381-L390
>>>>
>>>> [2]
>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/common.py#L477-L488
>>>> [3]
>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/common.py#L605-L620
>>>>
>>>

Re: [QUESTION] Should DoFns be able to get the watermark?

2019-04-09 Thread Pablo Estrada

Yup! : ) - I think

On Tue, Apr 9, 2019 at 3:52 PM Brian Hulette  wrote:

> Are these the blog posts?
>
> https://beam.apache.org/blog/2017/02/13/stateful-processing.html
> https://beam.apache.org/blog/2017/08/28/timely-processing.html
>
> On Tue, Apr 9, 2019 at 3:41 PM Pablo Estrada  wrote:
>
>> sunds good. Thanks guys <3
>>
>> On Tue, Apr 9, 2019 at 3:19 PM Lukasz Cwik  wrote:
>>
>>> UnboundedSources and SplittableDoFns report watermarks which the runner
>>> uses to compute how much the watermark could advance if it processed some
>>> outstanding work. But it is always upto the runner to choose when the
>>> watermark advances. The runner could process each work item in watermark
>>> priority order and advance the watermark in small increments or could
>>> process many work items and then advance the watermark a lot. (Note that
>>> the BoundedSources API doesn't allow for reporting the watermark and it
>>> starts at Beam's concept of START OF TIME and advances in one step to
>>> Beam's concept of END OF TIME).
>>>
>>> You might be able to write what you want with an event based timer. Kenn
>>> wrote (2?) blog posts on state and timers that have some pretty good
>>> explanations and examples.
>>>
>>> On Tue, Apr 9, 2019 at 2:27 PM Pablo Estrada  wrote:
>>>
>>>> hi Luke,
>>>> thanks for the prompt reply: )
>>>>
>>>> That makes sense. I think I'll go back to my cave to read a bunch about
>>>> streaming. : )
>>>>
>>>> I was looking for this to try to write a sequence generator for Python
>>>> in streaming, and I was trying to debug what was going on. I was trying to
>>>> allow the DoFn to receive a watermark reported by the upstream source. (...
>>>> does that answer "which watermark?"... I am not sure that it does... but
>>>> maybe..).
>>>>
>>>> Do you think it's a reasonable use case for DoFns to know what the
>>>> upstream watermark is?
>>>> I hope that makes at least a some sense... : )
>>>>
>>>> If it doesn't make sense, feel free to ignore, and I'll go do my
>>>> readings.
>>>> Thanks!
>>>> -P.
>>>>
>>>> On Tue, Apr 9, 2019 at 1:44 PM Lukasz Cwik  wrote:
>>>>
>>>>> WatermarkReporterParam is about reporting the watermark. The main
>>>>> usecase is for SplittableDoFns to be able to report the data watermark.
>>>>>
>>>>> The watermark is per input and output of a DoFn. Also each bundle
>>>>> being processed has its local watermarks while the runner computes the
>>>>> global watermark. The runners watermark could be per key, or key range or
>>>>> global across all keys.
>>>>>
>>>>> There is no runner agnostic way to read the watermark today. Is there
>>>>> a usecase you are targeting that would help from having access to the
>>>>> watermark (also, which watermark?)?
>>>>>
>>>>>
>>>>> On Tue, Apr 9, 2019 at 1:28 PM Pablo Estrada 
>>>>> wrote:
>>>>>
>>>>>> I am experimenting with state / timers in Python. As I look at the
>>>>>> DoFnProcessParams[1], I see that it's possible for a DoFn to receive
>>>>>> several arguments (e.g. Timers, Side Inputs, etc). Also the Watermark via
>>>>>> WatermarkReporterParam.
>>>>>>
>>>>>> I see that this parameter is not handled by runners when filling up
>>>>>> the arguments for a DoFn[2][3]. So, as far as I can tell, DoFns are not
>>>>>> currently able to get the watermark.
>>>>>>
>>>>>> Is this a bug, or is it intentional? Perhaps there's another way to
>>>>>> find out the watermark for a DoFn?
>>>>>>
>>>>>> Best
>>>>>> -P.
>>>>>>
>>>>>> [1]
>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/core.py#L381-L390
>>>>>>
>>>>>> [2]
>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/common.py#L477-L488
>>>>>> [3]
>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/common.py#L605-L620
>>>>>>
>>>>>

Re: Spotless check on build.gradle files

2019-04-09 Thread Pablo Estrada

Thanks Lukasz.

It seems that there's a spotless task added to applyGroovyNature:
https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L1346-L1355

Perhaps it's missing some more conditions?
-P.

On Tue, Apr 9, 2019 at 2:57 AM Łukasz Gajowy  wrote:

> Hi,
>
> it seems that spotless doesn't check build.gradle files if
> applyGroovyNature() is not used in them. We noticed this during code review
> - spotless succeeded despite improper indentation in a build.gradle file.
> IMO, such files should be checked by default, so I created an issue for
> this: https://issues.apache.org/jira/browse/BEAM-7040 (currently, no
> one's assigned)
>
> Łukasz
>

Re: JDK11 support?

2019-04-09 Thread Pablo Estrada

To clarify: This included adding tests to run continuously on Jenkins (by
my previous phrasing, it may have sounded like we only tested this manually
:P).

On Tue, Apr 9, 2019 at 4:57 PM Pablo Estrada  wrote:

> Our work (mostly done by Michal), consisted on testing that Beam artifacts
> built with Java 8 can run in a Java 11 environment in the DirectRunner and
> Dataflow. We have not planned other work in this area.
>
> On Tue, Apr 9, 2019 at 3:00 PM Ahmet Altay  wrote:
>
>> Most recently +Pablo Estrada  and +Michał Walenia
>>  were working on Java 11 related JIRAs. They
>> may have more context.
>>
>> On Tue, Apr 9, 2019 at 1:22 PM Maximilian Michels  wrote:
>>
>>> Hi Yi,
>>>
>>> That's a great question. Beam is still on Java 8 at the moment. There is
>>> a JIRA issue for making Beam compatible with Java 11:
>>> https://issues.apache.org/jira/browse/BEAM-2530
>>>
>>> As you can read in the issue, the upcoming Beam 2.12.0 has experimental
>>> support for Java 11. That said, there is still code that needs a major
>>> overhaul, e.g. automatic staging of pipeline jars
>>> (https://issues.apache.org/jira/browse/BEAM-5495).
>>>
>>> I hope we can achieve proper Java 11 compatibility in the next releases.
>>> Perhaps somebody else can give further insight into how much work is
>>> left.
>>>
>>> Thanks,
>>> Max
>>>
>>> On 09.04.19 22:07, Yi Pan wrote:
>>> > Hi, everyone,
>>> >
>>> > I checked out the latest Beam code and seems that the document still
>>> > says "JDK8". Is there any plan to officially support JDK11?
>>> >
>>> > Thanks a lot!
>>> >
>>> > -Yi
>>>
>>

Re: JDK11 support?

2019-04-09 Thread Pablo Estrada

Our work (mostly done by Michal), consisted on testing that Beam artifacts
built with Java 8 can run in a Java 11 environment in the DirectRunner and
Dataflow. We have not planned other work in this area.

On Tue, Apr 9, 2019 at 3:00 PM Ahmet Altay  wrote:

> Most recently +Pablo Estrada  and +Michał Walenia
>  were working on Java 11 related JIRAs. They
> may have more context.
>
> On Tue, Apr 9, 2019 at 1:22 PM Maximilian Michels  wrote:
>
>> Hi Yi,
>>
>> That's a great question. Beam is still on Java 8 at the moment. There is
>> a JIRA issue for making Beam compatible with Java 11:
>> https://issues.apache.org/jira/browse/BEAM-2530
>>
>> As you can read in the issue, the upcoming Beam 2.12.0 has experimental
>> support for Java 11. That said, there is still code that needs a major
>> overhaul, e.g. automatic staging of pipeline jars
>> (https://issues.apache.org/jira/browse/BEAM-5495).
>>
>> I hope we can achieve proper Java 11 compatibility in the next releases.
>> Perhaps somebody else can give further insight into how much work is left.
>>
>> Thanks,
>> Max
>>
>> On 09.04.19 22:07, Yi Pan wrote:
>> > Hi, everyone,
>> >
>> > I checked out the latest Beam code and seems that the document still
>> > says "JDK8". Is there any plan to officially support JDK11?
>> >
>> > Thanks a lot!
>> >
>> > -Yi
>>
>

Re: Unexpected TestStream behavior when testing stateful DoFn

2019-04-09 Thread Pablo Estrada

I've added your snippet as a unit test, and fixed that test[1] : ) - thanks
for reporting the issue. Please let me know if you see any other issues
like this.
Best
-P.

[1] https://github.com/apache/beam/pull/8252

On Thu, Apr 4, 2019 at 2:46 PM Pablo Estrada  wrote:

> I saw similar issues. I'll try to debug this tomorrow. It'll take some
> time to study the code, so we'll see. Assigning the issue to me.
>
> On Fri, Mar 29, 2019 at 6:43 AM Steve Niemitz  wrote:
>
>> This reminds me of a bug I had filed for the direct runner a few weeks
>> ago, except I was running into it in Java and not python.
>> https://issues.apache.org/jira/browse/BEAM-6813
>>
>> On Thu, Mar 28, 2019 at 9:04 PM Ahmet Altay  wrote:
>>
>>> I agree it looks like a bug. Filed
>>> https://issues.apache.org/jira/browse/BEAM-6934
>>>
>>>
>>>
>>> On Thu, Mar 28, 2019 at 5:18 PM Reuven Lax  wrote:
>>>
>>>> This looks like a bug to me.
>>>>
>>>> On Thu, Mar 28, 2019 at 2:52 PM Amar Pai  wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm running into some unexpected behavior when trying to unit test a
>>>>> stateful DoFn that uses watermark timers as well as bag state.  I'm
>>>>> following the example here:
>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/userstate_test.py#L333
>>>>>
>>>>> Expected behavior:
>>>>> When using TestStream, if a stateful DoFn adds value 'foo' to BagState
>>>>> then sets two watermark timers t1 and t2, where t1 clears the bag state 
>>>>> and
>>>>> t2 reads from bag state and emits the contents, if t1.time < t2.time then
>>>>> nothing should get emitted when the TestPipeline is run. (bag state should
>>>>> be cleared by timer at t1 before it is read from by timer at t2)
>>>>>
>>>>> Actual behavior:
>>>>> For the scenario described above, results get emitted despite t1.time
>>>>> being less than t2.time.
>>>>> I've created a gist with a demonstration of the problem:
>>>>> https://gist.github.com/jcruelty/3bf5ce5865110372a2d1650b1421cde1
>>>>>
>>>>>
>>>>> Is this a bug?  Would appreciate your thoughts.
>>>>> cheers
>>>>>
>>>>> Amar
>>>>>
>>>>

Re: [ANNOUNCE] New committer announcement: Boyuan Zhang

2019-04-10 Thread Pablo Estrada

Well deserved : ) congrats Boyuan!

On Wed, Apr 10, 2019 at 10:08 AM Aizhamal Nurmamat kyzy 
wrote:

> Congratulations Boyuan!
>
> On Wed, Apr 10, 2019 at 9:52 AM Ruoyun Huang  wrote:
>
>> Thanks for your contributions and congratulations Boyuan!
>>
>> On Wed, Apr 10, 2019 at 9:00 AM Kenneth Knowles  wrote:
>>
>>> Hi all,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a new committer:
>>> Boyuan Zhang.
>>>
>>> Boyuan has been contributing to Beam since early 2018. She has proposed
>>> 100+ pull requests across a wide range of topics: bug fixes, to integration
>>> tests, build improvements, metrics features, release automation. Two big
>>> picture things to highlight are building/releasing Beam Python wheels and
>>> managing the donation of the Beam Dataflow Java Worker, including help with
>>> I.P. clearance.
>>>
>>> In consideration of Boyuan's contributions, the Beam PMC trusts Boyuan
>>> with the responsibilities of a Beam committer [1].
>>>
>>> Thank you, Boyuan, for your contributions.
>>>
>>> Kenn
>>>
>>> [1] https://beam.apache.org/contribute/become-a-committer/#an-apache-
>>> beam-committer
>>>
>>
>>
>> --
>> 
>> Ruoyun  Huang
>>
>>

Re: ParDo Execution Time stat is always 0

2019-04-10 Thread Pablo Estrada

This sounds like a bug then? +Alex Amato 

On Wed, Apr 10, 2019 at 3:59 AM Maximilian Michels  wrote:

> Hi @all,
>
>  From a quick debugging session, I conclude that the wiring is in place
> for the Flink Runner. There is a ProgressReporter that reports
> MonitoringInfos to Flink, in a similar fashion as the "legacy" Runner.
>
> The bundle duration metrics are 0, but the element count gets reported
> correctly. It appears to be an issue of the Python/Java harness because
> "ProcessBundleProgressResponse" contains only 0 values for the bundle
> duration.
>
> Thanks,
> Max
>
> On 04.04.19 19:54, Mikhail Gryzykhin wrote:
> > Hi everyone,
> >
> > Quick summary on python and Dataflow Runner:
> > Python SDK already reports:
> > - MSec
> > - User metrics (int64 and distribution)
> > - PCollection Element Count
> > - Work on MeanByteCount for pcollection is ongoing here
> > <https://github.com/apache/beam/pull/8062>.
> >
> > Dataflow Runner:
> > - all metrics listed above are passed through to Dataflow.
> >
> > Ryan can give more information on Flink Runner. I also see Maximilian on
> > some of relevant PRs, so he might comment on this as well.
> >
> > Regards,
> > Mikhail.
> >
> >
> > On Thu, Apr 4, 2019 at 10:43 AM Pablo Estrada  > <mailto:pabl...@google.com>> wrote:
> >
> > Hello guys!
> > Alex, Mikhail and Ryan are working on support for metrics in the
> > portability framework. The support on the SDK is pretty advanced
> > AFAIK*, and the next step is to get the metrics back into the
> > runner. Lukazs and myself are working on a project that depends on
> > this too, so I'm adding everyone so we can get an idea of what's
> > missing.
> >
> > I believe:
> > - User metrics are fully wired up in the SDK
> > - State sampler (timing) metrics are wired up as well (is that
> > right, +Alex Amato <mailto:ajam...@google.com>?)
> > - Work is ongoing to send the updates back to Flink.
> > - What is the plan for making metrics queriable from Flink? +Ryan
> > Williams <mailto:r...@runsascoded.com>
> >
> > Thanks!
> > -P.
> >
> >
> >
> > On Wed, Apr 3, 2019 at 12:02 PM Thomas Weise  > <mailto:t...@apache.org>> wrote:
> >
> > I believe this is where the metrics are supplied:
> >
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/operations.py
> >
> > git grep process_bundle_msecs   yields results for dataflow
> > worker only
> >
> > There isn't any test coverage for the Flink runner:
> >
> >
> https://github.com/apache/beam/blob/d38645ae8758d834c3e819b715a66dd82c78f6d4/sdks/python/apache_beam/runners/portability/flink_runner_test.py#L181
> >
> >
> >
> > On Wed, Apr 3, 2019 at 10:45 AM Akshay Balwally
> > mailto:abalwa...@lyft.com>> wrote:
> >
> > Should have added- I'm using Python sdk, Flink runner
> >
> > On Wed, Apr 3, 2019 at 10:32 AM Akshay Balwally
> > mailto:abalwa...@lyft.com>> wrote:
> >
> > Hi,
> > I'm hoping to get metrics on the amount of time spent on
> > each operator, so it seams like the stat
> >
> >
>  
> {organization_specific_prefix}.operator.beam-metric-pardo_execution_time-process_bundle_msecs-v1.gauge.mean
> >
> > would be pretty helpful. But in practice, this stat
> > always shows 0, which I interpret as 0 milliseconds
> > spent per bundle, which can't be correct (other stats
> > show that the operators are running, and timers within
> > the operators show more reasonable times). Is this a
> > known bug?
> >
> >
> > --
> > *Akshay Balwally*
> > Software Engineer
> > 937.271.6469 
> > Lyft <http://www.lyft.com/>
> >
> >
> >
> > --
> > *Akshay Balwally*
> > Software Engineer
> > 937.271.6469 
> > Lyft <http://www.lyft.com/>
> >
>

Re: Projects Can Apply Individually for Google Season of Docs

2019-04-10 Thread Pablo Estrada

I'd be happy to be a mentor for this to help add getting started
documentation for Python on Flink. I'd want to focus on the reviews and
less on the administration - so I'm willing to be a secondary administrator
if that's necessary to move forward, but I'd love it if someone would help
administer.
FWIW, neither the administrator nor any other mentor has to be a committer.

Anyone willing to be primary administrator and also a mentor?

Thanks
-P.

On Fri, Apr 5, 2019 at 9:40 AM Kenneth Knowles  wrote:

> Yes, this is great. Thanks for noticing the call and pushing ahead on
> this, Aizhamal!
>
> I would also like to see the runner comparison revamp at
> https://issues.apache.org/jira/browse/BEAM-2888 which would help users
> really understand what they can and cannot do in plain terms.
>
> Kenn
>
> On Fri, Apr 5, 2019 at 9:30 AM Ahmet Altay  wrote:
>
>> Thank you Aizhamal for volunteering. I am happy to help as an
>> administrator.
>>
>> cc: +Rose Nguyen  +Melissa Pashniak
>>  in case they will be interested in mentorship
>> and/or administration.
>>
>>
>>
>>
>> On Fri, Apr 5, 2019 at 9:16 AM Thomas Weise  wrote:
>>
>>> This is great. Beam documentation needs work in several areas, Python
>>> SDK, portability and SQL come to mind right away :)
>>>
>>>
>>> On Thu, Apr 4, 2019 at 4:21 PM Aizhamal Nurmamat kyzy <
>>> aizha...@google.com> wrote:
>>>
 Hello everyone,

 As the ASF announced that each project can apply for Season of Docs
 individually, I would like to volunteer to be one of the administrators for
 the program. Is this fine for everyone in the community? If so, I will
 start working on application on behalf of Beam this week, and I will send
 updates on this thread with progress.

 The program requires two administrators, so any volunteers would be
 appreciated. I’m happy to take on the administrative load, and partner with
 committers or PMC members. We will also need at least two mentors for the
 program, to onboard tech writers to the project and work with them closely
 during 3 months period. Please express your interest in the thread :)

 If you have some ideas to work on for Season of Docs, please let me
 know directly, or file a JIRA issue, and add the "gsod" and "gsod2019"
 labels to it. It will help us to gather ideas and put them together in the
 application.

 Thanks everybody,
 Aizhamal

 On Wed, Apr 3, 2019 at 1:55 PM  wrote:

> Hi All
>
> Initially the ASF as an organisation was planning to apply as a
> mentoring organisation for Google Season of Docs on behalf of all Apache
> projects but if accepted the maximum number of technical writers we could
> allocated is two. Two technical writers would probably not be enough to
> cover the potential demand from all our projects interested in
> participating.
>
> We've received feedback from Google that individual projects can
> apply. I will withdraw the ASF application so that any Apache project
> interested can apply individually for Season of Docs and so have the
> potential of being allocated a technical writer.
>
> Applications for Season of Docs is open now and closes on 23rd April
> 2019. If your project would like to apply then please see the following
> link:
>
> https://developers.google.com/season-of-docs/docs/get-started/
>
> Good luck everyone!
>
> Thanks
> Sharan
>
>
>

Re: [docs] Python State & Timers

2019-04-11 Thread Pablo Estrada

I've been going over this topic a bit. I'll add the snippets next week, if
that's fine by y'all.
Best
-P.

On Thu, Apr 11, 2019 at 5:27 AM Robert Bradshaw  wrote:

> That's a great idea! It would probably be pretty easy to add the
> corresponding code snippets to the docs as well.
>
> On Thu, Apr 11, 2019 at 2:00 PM Maximilian Michels  wrote:
> >
> > Hi everyone,
> >
> > The Python SDK still lacks documentation on state and timers.
> >
> > As a first step, what do you think about updating these two blog posts
> > with the corresponding Python code?
> >
> > https://beam.apache.org/blog/2017/02/13/stateful-processing.html
> > https://beam.apache.org/blog/2017/08/28/timely-processing.html
> >
> > Thanks,
> > Max
>

Re: Accessing CSV headers

2019-04-11 Thread Pablo Estrada

What SDK are you using?

There are FileIO transforms that should help you do this. We can help you
with a snippet if you share the SDK you're using.

On Thu, Apr 11, 2019 at 2:16 PM tejanahme...@gmail.com <
tejanahme...@gmail.com> wrote:

> Hi
>
> I am new to Apache Beam, and have been trying to access the CSV headers
> from a csv file I referenced as input.
>
> Is there a way to collect the headers in a CSV using Apache Beam so that
> the following records can be processed and compared as columns and rows to
> another csv file?
>
> Please advise.
>
> Thanks in advance.
>

Re: [review?] WordCount in Kotlin

2019-04-12 Thread Pablo Estrada

I've merged this here: https://github.com/apache/beam/pull/8291

Thanks for all who took a look, and to Harshit for the contribution. : )

On Thu, Apr 4, 2019 at 10:30 PM Jean-Baptiste Onofré 
wrote:

> Thanks for the update Pablo.
>
> I will try to take a look during the week end.
>
> Regards
> JB
>
> On 04/04/2019 23:16, Pablo Estrada wrote:
> > Hello all,
> > as community member has been very kind to contribute a Kotlin
> > translation of the WordCount pipeline[1]. The documentation, tests, and
> > gradle structure for it is very good, so I am happy to merge, but since
> > this code will become our first Kotlin "documentation"/entrypoint, I
> > wanted to be cautious.
> > So if anyone wants to take a look to review the change, please do. I
> > will merge this in a couple days.
> > Thanks!
> > -P.
> >
> > [1] https://github.com/apache/beam/pull/8034
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [review?] WordCount in Kotlin

2019-04-12 Thread Pablo Estrada

I've merged via a squashed commit that references Jira and the PR. That
should be reasonable?
Best
-P.

On Fri, Apr 12, 2019, 12:22 PM Ankur Goenka  wrote:

> Thanks Pablo and Harshit.
>
> Just a quick reminder, please squash the "fixup" sort of commits in the PR
> based on the prior discussion on the mailing list
> https://lists.apache.org/thread.html/6d922820d6fc352479f88e5c8737f2c8893ddb706a1e578b50d28948@%3Cdev.beam.apache.org%3E
>
> On Fri, Apr 12, 2019 at 11:58 AM Pablo Estrada  wrote:
>
>> I've merged this here: https://github.com/apache/beam/pull/8291
>>
>> Thanks for all who took a look, and to Harshit for the contribution. : )
>>
>> On Thu, Apr 4, 2019 at 10:30 PM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Thanks for the update Pablo.
>>>
>>> I will try to take a look during the week end.
>>>
>>> Regards
>>> JB
>>>
>>> On 04/04/2019 23:16, Pablo Estrada wrote:
>>> > Hello all,
>>> > as community member has been very kind to contribute a Kotlin
>>> > translation of the WordCount pipeline[1]. The documentation, tests, and
>>> > gradle structure for it is very good, so I am happy to merge, but since
>>> > this code will become our first Kotlin "documentation"/entrypoint, I
>>> > wanted to be cautious.
>>> > So if anyone wants to take a look to review the change, please do. I
>>> > will merge this in a couple days.
>>> > Thanks!
>>> > -P.
>>> >
>>> > [1] https://github.com/apache/beam/pull/8034
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>

Re: Go SDK status

2019-04-15 Thread Pablo Estrada

+Robert Burke  ; ) thoughts?

- AFAIK, we have wordcount running on Flink

On Sat, Apr 13, 2019 at 11:31 AM Thomas Weise  wrote:

> How "experimental" is the Go SDK? What are the major work items to reach
> MVP? How close are we to be able to run let's say wordcount on the portable
> Flink runner?
>
> How current is the roadmap [1]? JIRA [2] could suggest that there is a lot
> of work left to do?
>
> Thanks,
> Thomas
>
> [1] https://beam.apache.org/roadmap/go-sdk/
> [2]
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20and%20component%20%3D%20sdk-go%20and%20resolution%20%3D%20Unresolved%20
>
>

Re: [DOC] Portable Spark Runner

2019-04-15 Thread Pablo Estrada

This is very cool Kyle. Thanks for moving it forward!
Best
-P.

On Fri, Apr 12, 2019 at 1:21 PM Lukasz Cwik  wrote:

> Thanks for the doc.
>
> On Fri, Apr 12, 2019 at 11:34 AM Kyle Weaver  wrote:
>
>> Hi everyone,
>>
>> As some of you know, I've been piggybacking on the existing Spark and
>> Flink runners to create a portable version of the Spark runner. I wrote up
>> a summary of the work I've done so far and what remains to be done. I'll
>> keep updating this going forward to provide a reasonably up-to-date
>> description of the state of the project. Please comment on the doc if you
>> have any thoughts.
>>
>> Link:
>>
>> https://docs.google.com/document/d/1j8GERTiHUuc6CzzCXZHc38rBn41uWfATBh2-5JN8hro/edit?usp=sharing
>>
>> Thanks,
>> Kyle
>>
>> Kyle Weaver |  Software Engineer | github.com/ibzib |
>> kcwea...@google.com |  +1650203
>>
>

Re: Hi, some sample about Extracting data from Xlsx ?

2019-04-15 Thread Pablo Estrada

Hello Henrique,

I am not aware of existing Beam transforms specifically used for reading in
XLSX data. Can you share what you mean by "examples related with Cs
extension"?

I am aware of some Python libraries foir this sort of thing[1]. You could
use the FileIO transforms in the Python SDK to find each file, and then
write a DoFn that is able to read in data from these files. Check out this
unit test using FileIO to read CSV files[2].

Let me know if that helps, or if I went on the wrong direction of what you
needed.
Best
-P.

[1] https://openpyxl.readthedocs.io/en/stable/
[2]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio_test.py#L128-L148

On Mon, Apr 15, 2019 at 12:47 PM Henrique Molina 
wrote:

> Hello
>
> I would like to use best practices from Apache Beams to read Xlsx. however
> I found examples only related with Cs extension.
> someone there is sample using ParDo to Collect all columns and sheets from
> Excel xlsx ?
> Afterwards I will put into google Big query.
>
> Thanks & Regards
>
>

Re: Hi, some sample about Extracting data from Xlsx ?

2019-04-16 Thread Pablo Estrada

Hm I am not very familiar with POI, but if its transforms are able to take
in a file descriptor, you should be able to use FileIO.match()[0] to find
your files (local, or in GCS/S3/HDFS); and FileIO.readMatches()[1] to get
file descriptors for these files.

If the POI libraries require the files to be local in your machine, you may
need to use FileSystems.copy[2] to move your files locally, and then
analyze them.

Let me know if those are some useful building blocks for your pipeline,
Best
-P.

[0]
https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/FileIO.html#match--

[1]
https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/FileIO.html#readMatches--
[2]
https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/FileSystems.html#copy-java.util.List-java.util.List-org.apache.beam.sdk.io.fs.MoveOptions...-


On Mon, Apr 15, 2019 at 6:20 PM Henrique Molina 
wrote:

> Hi Pablo ,
> Thanks for your attention,
> I so sorry, my bad written "Cs extension " I did means .csv extension !
> The example like this: load-csv-file-from-google-cloud-storage
> <https://kontext.tech/docs/DataAndBusinessIntelligence/p/load-csv-file-from-google-cloud-storage-to-bigquery-using-dataflow>
>
> I was think Using apache POI to read each row from sheet  throwing to next
> ParDo an CellRow rows
> same like that:
> .apply("xlsxToMap", ParDo.of(new DoFn() {.
>
> I don't know if it is more ellegant...
>
> If your have some Idea ! let me know . it will be welcome!!
>
>
> On Mon, Apr 15, 2019 at 6:01 PM Pablo Estrada  wrote:
>
>> Hello Henrique,
>>
>> I am not aware of existing Beam transforms specifically used for reading
>> in XLSX data. Can you share what you mean by "examples related with Cs
>> extension"?
>>
>> I am aware of some Python libraries foir this sort of thing[1]. You could
>> use the FileIO transforms in the Python SDK to find each file, and then
>> write a DoFn that is able to read in data from these files. Check out this
>> unit test using FileIO to read CSV files[2].
>>
>> Let me know if that helps, or if I went on the wrong direction of what
>> you needed.
>> Best
>> -P.
>>
>> [1] https://openpyxl.readthedocs.io/en/stable/
>> [2]
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio_test.py#L128-L148
>>
>> On Mon, Apr 15, 2019 at 12:47 PM Henrique Molina <
>> henrique.mol...@gmail.com> wrote:
>>
>>> Hello
>>>
>>> I would like to use best practices from Apache Beams to read Xlsx.
>>> however I found examples only related with Cs extension.
>>> someone there is sample using ParDo to Collect all columns and sheets
>>> from Excel xlsx ?
>>> Afterwards I will put into google Big query.
>>>
>>> Thanks & Regards
>>>
>>>
>>

Possible bug in accumulating triggers Python DirectRunner?

2019-04-19 Thread Pablo Estrada

Hello all,
I've been slowly learning a bit about life in streaming, with state,
timers, triggers, etc.

The other day, I tried out a trigger pipeline that did not have the
behavior that I was expecting, and I am looking for feedback on whether I'm
missing something, or this is a bug.

Please take a look at this unit test:

https://github.com/apache/beam/pull/8364/files#diff-38fb631ae11ed485e2b99507e96ff9ffR451

Is the check correct that we would expect range [1, 6) to appear twice?
i.e. concat([1, 6), [1, 10]) ?

I have not tested this in other runners.
Thanks
-P.

Re: New IOIT Dashboards

2019-04-19 Thread Pablo Estrada

Woah this is excellent. It's very nice to see that the metrics are more
consistent now, and the signal of the tests will be much more useful. Love
it!
Best
-P.

On Fri, Apr 19, 2019 at 1:59 PM Łukasz Gajowy 
wrote:

> @Kenn Yes, seconds. I added suffixes to widget's legends to clarify.
> @Ankur It's still Dataflow only but we're closer to use Flink for that too
> (we already have the cluster setup in our codebase). Meanwhile, I fixed the
> dashboard title to clarify this too.
>
> Thanks,
> Łukasz
>
> pt., 19 kwi 2019 o 21:35 Ankur Goenka  napisał(a):
>
>> This looks great!
>> Which runner are we using for the pipeline?
>>
>> On Fri, Apr 19, 2019 at 12:03 PM Kenneth Knowles  wrote:
>>
>>> Very cool! I assume times are all in seconds?
>>>
>>> On Fri, Apr 19, 2019 at 6:26 AM Łukasz Gajowy 
>>> wrote:
>>>
 Hi,

 just wanted to announce that we improved the way we collect metrics
 from IOIT. Now we use Metrics API for this which allowed us to get more
 insights and collect run/read/write time (and possibly other metrics in the
 future) separately.

 The new dashboards are available here:
 https://s.apache.org/io-test-dashboards
 I also updated the docs in this PR:
 https://github.com/apache/beam/pull/8356

 Thanks,
 Łukasz

Re: Possible bug in accumulating triggers Python DirectRunner?

2019-04-19 Thread Pablo Estrada

Ah sorry for the lack of clarification. Each element appear only once in
the final output. The failure is:

==
> FAIL: test_multiple_accumulating_firings
> (apache_beam.transforms.trigger_test.TriggerPipelineTest)
> --
> Traceback (most recent call last):
>   File "apache_beam/transforms/trigger_test.py", line 491, in
> test_multiple_accumulating_firings
> TriggerPipelineTest.all_records)
> AssertionError: Lists differ: ['1', '2', '3', '4', '5', '1',... != ['1',
> '2', '3', '4', '5', '6',...
>
[...other output...]

(expected is:)

> - ['1', '2', '3', '4', '5', '1', '2', '3', '4', '5', '6', '7', '8', '9',
> '10']
> ?   -
>
(actual is:)

> + ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
> --




On Fri, Apr 19, 2019 at 3:45 PM Kenneth Knowles  wrote:

> What is the behavior you are seeing?
>
> Kenn
>
> On Fri, Apr 19, 2019 at 3:14 PM Ahmet Altay  wrote:
>
>>
>>
>> On Fri, Apr 19, 2019 at 1:58 PM Pablo Estrada  wrote:
>>
>>> Hello all,
>>> I've been slowly learning a bit about life in streaming, with state,
>>> timers, triggers, etc.
>>>
>>> The other day, I tried out a trigger pipeline that did not have the
>>> behavior that I was expecting, and I am looking for feedback on whether I'm
>>> missing something, or this is a bug.
>>>
>>> Please take a look at this unit test:
>>>
>>>
>>> https://github.com/apache/beam/pull/8364/files#diff-38fb631ae11ed485e2b99507e96ff9ffR451
>>>
>>> Is the check correct that we would expect range [1, 6) to appear twice?
>>> i.e. concat([1, 6), [1, 10]) ?
>>>
>>
>> This is what I would expect. Your test code looks good to me. Could you
>> file an issue?
>>
>>
>>>
>>> I have not tested this in other runners.
>>> Thanks
>>> -P.
>>>
>>

Re: Possible bug in accumulating triggers Python DirectRunner?

2019-04-19 Thread Pablo Estrada

created https://jira.apache.org/jira/browse/BEAM-7122
Best
-P.

On Fri, Apr 19, 2019 at 3:50 PM Pablo Estrada  wrote:

> Ah sorry for the lack of clarification. Each element appear only once in
> the final output. The failure is:
>
> ==
>> FAIL: test_multiple_accumulating_firings
>> (apache_beam.transforms.trigger_test.TriggerPipelineTest)
>> --
>> Traceback (most recent call last):
>>   File "apache_beam/transforms/trigger_test.py", line 491, in
>> test_multiple_accumulating_firings
>> TriggerPipelineTest.all_records)
>> AssertionError: Lists differ: ['1', '2', '3', '4', '5', '1',... != ['1',
>> '2', '3', '4', '5', '6',...
>>
> [...other output...]
>
> (expected is:)
>
>> - ['1', '2', '3', '4', '5', '1', '2', '3', '4', '5', '6', '7', '8', '9',
>> '10']
>> ?   -----
>>
> (actual is:)
>
>> + ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
>> --
>
>
>
>
> On Fri, Apr 19, 2019 at 3:45 PM Kenneth Knowles  wrote:
>
>> What is the behavior you are seeing?
>>
>> Kenn
>>
>> On Fri, Apr 19, 2019 at 3:14 PM Ahmet Altay  wrote:
>>
>>>
>>>
>>> On Fri, Apr 19, 2019 at 1:58 PM Pablo Estrada 
>>> wrote:
>>>
>>>> Hello all,
>>>> I've been slowly learning a bit about life in streaming, with state,
>>>> timers, triggers, etc.
>>>>
>>>> The other day, I tried out a trigger pipeline that did not have the
>>>> behavior that I was expecting, and I am looking for feedback on whether I'm
>>>> missing something, or this is a bug.
>>>>
>>>> Please take a look at this unit test:
>>>>
>>>>
>>>> https://github.com/apache/beam/pull/8364/files#diff-38fb631ae11ed485e2b99507e96ff9ffR451
>>>>
>>>> Is the check correct that we would expect range [1, 6) to appear twice?
>>>> i.e. concat([1, 6), [1, 10]) ?
>>>>
>>>
>>> This is what I would expect. Your test code looks good to me. Could you
>>> file an issue?
>>>
>>>
>>>>
>>>> I have not tested this in other runners.
>>>> Thanks
>>>> -P.
>>>>
>>>

Re: Possible bug in accumulating triggers Python DirectRunner?

2019-04-19 Thread Pablo Estrada

Aah that makes more sense... I'll try that out. Thanks!

On Fri, Apr 19, 2019 at 4:12 PM Kenneth Knowles  wrote:

> Oh, wait I didn't even read the pipeline well. You don't have a GBK so
> triggers don't do anything. They only apply to aggregations. Since it is
> just a ParDo the elements flow right through and your results are expected.
> If you did have a GBK then you would have this:
>
> Expected: [ ['1', '2', '3', '4', '5'], ['1', '2', '3', '4', '5', '6', '7',
> '8', '9', '10'] ]
> Actual: [ ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10'] ]
>
> Where both outer lists are PCollections, hence could be reordered, and
> both inner lists are also in an undefined ordered. They have a pane index
> that says their logical order but they can be reordered. It is unusual and
> runner-dependent but best to check PCollection contents without
> order-dependence.
>
> Do you have PAssert for these sorts of checks?
>
> Kenn
>
> On Fri, Apr 19, 2019 at 4:02 PM Pablo Estrada  wrote:
>
>> created https://jira.apache.org/jira/browse/BEAM-7122
>> Best
>> -P.
>>
>> On Fri, Apr 19, 2019 at 3:50 PM Pablo Estrada  wrote:
>>
>>> Ah sorry for the lack of clarification. Each element appear only once in
>>> the final output. The failure is:
>>>
>>> ==
>>>> FAIL: test_multiple_accumulating_firings
>>>> (apache_beam.transforms.trigger_test.TriggerPipelineTest)
>>>> --
>>>> Traceback (most recent call last):
>>>>   File "apache_beam/transforms/trigger_test.py", line 491, in
>>>> test_multiple_accumulating_firings
>>>> TriggerPipelineTest.all_records)
>>>> AssertionError: Lists differ: ['1', '2', '3', '4', '5', '1',... !=
>>>> ['1', '2', '3', '4', '5', '6',...
>>>>
>>> [...other output...]
>>>
>>> (expected is:)
>>>
>>>> - ['1', '2', '3', '4', '5', '1', '2', '3', '4', '5', '6', '7', '8',
>>>> '9', '10']
>>>> ?   -
>>>>
>>> (actual is:)
>>>
>>>> + ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
>>>> --
>>>
>>>
>>>
>>>
>>> On Fri, Apr 19, 2019 at 3:45 PM Kenneth Knowles  wrote:
>>>
>>>> What is the behavior you are seeing?
>>>>
>>>> Kenn
>>>>
>>>> On Fri, Apr 19, 2019 at 3:14 PM Ahmet Altay  wrote:
>>>>
>>>>>
>>>>>
>>>>> On Fri, Apr 19, 2019 at 1:58 PM Pablo Estrada 
>>>>> wrote:
>>>>>
>>>>>> Hello all,
>>>>>> I've been slowly learning a bit about life in streaming, with state,
>>>>>> timers, triggers, etc.
>>>>>>
>>>>>> The other day, I tried out a trigger pipeline that did not have the
>>>>>> behavior that I was expecting, and I am looking for feedback on whether 
>>>>>> I'm
>>>>>> missing something, or this is a bug.
>>>>>>
>>>>>> Please take a look at this unit test:
>>>>>>
>>>>>>
>>>>>> https://github.com/apache/beam/pull/8364/files#diff-38fb631ae11ed485e2b99507e96ff9ffR451
>>>>>>
>>>>>> Is the check correct that we would expect range [1, 6) to appear
>>>>>> twice? i.e. concat([1, 6), [1, 10]) ?
>>>>>>
>>>>>
>>>>> This is what I would expect. Your test code looks good to me. Could
>>>>> you file an issue?
>>>>>
>>>>>
>>>>>>
>>>>>> I have not tested this in other runners.
>>>>>> Thanks
>>>>>> -P.
>>>>>>
>>>>>

Re: [VOTE] Release 2.12.0, release candidate #4

2019-04-20 Thread Pablo Estrada

+1
Ran SQL postcommit, and Dataflow Portability Java validatesrunner tests.

-P.

On Wed, Apr 17, 2019 at 1:38 AM Jean-Baptiste Onofré 
wrote:

> +1 (binding)
>
> Quickly checked with beam-samples.
>
> Regards
> JB
>
> On 16/04/2019 00:50, Andrew Pilloud wrote:
> > Hi everyone,
> >
> > Please review and vote on the release candidate #4 for the version
> > 2.12.0, as follows:
> >
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> > The complete staging area is available for your review, which includes:
> > * JIRA release notes [1],
> > * the official Apache source release to be deployed to dist.apache.org
> >  [2], which is signed with the key with
> > fingerprint 9E7CEC0661EFD610B632C610AE8FE17F9F8AE3D4 [3],
> > * all artifacts to be deployed to the Maven Central Repository [4],
> > * source code tag "v2.12.0-RC4" [5],
> > * website pull request listing the release [6], publishing the API
> > reference manual [7], and the blog post [8].
> > * Java artifacts were built with Gradle/5.2.1 and OpenJDK/Oracle JDK
> > 1.8.0_181.
> > * Python artifacts are deployed along with the source release to the
> > dist.apache.org  [2].
> > * Validation sheet with a tab for 2.12.0 release to help with validation
> > [9].
> >
> > The vote will be open for at least 72 hours. It is adopted by majority
> > approval, with at least 3 PMC affirmative votes.
> >
> > Thanks,
> > Andrew
> >
> > 1]
> https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12344944
> > [2] https://dist.apache.org/repos/dist/dev/beam/2.12.0/
> > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> > [4]
> https://repository.apache.org/content/repositories/orgapachebeam-1068/
> > [5] https://github.com/apache/beam/tree/v2.12.0-RC4
> > [6] https://github.com/apache/beam/pull/8215
> > [7] https://github.com/apache/beam-site/pull/588
> > [8] https://github.com/apache/beam/pull/8314
> > [9]
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1007316984
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Projects Can Apply Individually for Google Season of Docs

2019-04-22 Thread Pablo Estrada

Hello all,
thanks to everyone for your participation. I have submitted the application
on behalf of Beam, and requested one technical writer. Let's see how it
goes : )
Best
-P.

On Wed, Apr 17, 2019 at 10:09 PM Ahmet Altay  wrote:

> Thanks Aizhamal, I completed the forms.
>
> On Wed, Apr 17, 2019 at 6:46 PM Aizhamal Nurmamat kyzy <
> aizha...@google.com> wrote:
>
>> Hi everyone,
>>
>> Here are a few updates on our application for Season of Docs:
>>
>> 1. Pablo and I have created the following document [1] with some of the
>> project ideas shared in the mailing list. If you have more ideas, please
>> add them into the doc and provide description. If you also want to be a
>> mentor for the proposed ideas, please add your name in the table.
>>
>> 2. To submit our application, we need to publish our project ideas list.
>> For this we have opened a Jira tickets with “gsod2019” tag[2]. We should
>> maybe also think of adding a small blog post in the Beam website that
>> contains all the ideas in one place[3]? Please let me know what you think
>> on this.
>>
>> 3. By next week Tuesday (Application Deadline)
>>
>>-
>>
>>+pabl...@apache.org  , please complete the org
>>application form [4]
>>-
>>
>>@Ahmet Altay  , please complete alternative
>>administrator form [5]
>>-
>>
>>@pabl...@apache.org  , @Ahmet Altay
>>  , and all other contributors that want to
>>participate as mentors, please complete the mentor registration form [6]
>>
>>
>> Thank you,
>>
>> Aizhamal
>>
>>
>> [1]
>> https://docs.google.com/document/d/1FNf-BjB4Q7PDdqygPboLr7CyIeo6JAkrt0RBgs2I4dE/edit#
>>
>> [2]
>> https://issues.apache.org/jira/browse/BEAM-7104?jql=project%20%3D%20BEAM%20AND%20status%20%3D%20Open%20AND%20labels%20%3D%20gsod2019
>>
>> [3] https://beam.apache.org/blog/
>>
>> [4]
>> https://docs.google.com/forms/d/e/1FAIpQLScrEq5yKmadgn7LEPC8nN811-6DNmYvus5uXv_JY5BX7CH-Bg/viewform
>>
>> [5]
>> https://docs.google.com/forms/d/e/1FAIpQLSc5ZsBzqfsib-epktZp8bYxL_hO4RhT_Zz8AY6zXDHB79ue9g/viewform
>> [6]
>> https://docs.google.com/forms/d/e/1FAIpQLSe-JjGvaKKGWZOXxrorONhB8qN3mjPrB9ZVkcsntR73Cv_K7g/viewform
>>
>> On Wed, Apr 10, 2019 at 2:57 PM Pablo Estrada  wrote:
>>
>>> I'd be happy to be a mentor for this to help add getting started
>>> documentation for Python on Flink. I'd want to focus on the reviews and
>>> less on the administration - so I'm willing to be a secondary administrator
>>> if that's necessary to move forward, but I'd love it if someone would help
>>> administer.
>>> FWIW, neither the administrator nor any other mentor has to be a
>>> committer.
>>>
>>> Anyone willing to be primary administrator and also a mentor?
>>>
>>> Thanks
>>> -P.
>>>
>>> On Fri, Apr 5, 2019 at 9:40 AM Kenneth Knowles  wrote:
>>>
>>>> Yes, this is great. Thanks for noticing the call and pushing ahead on
>>>> this, Aizhamal!
>>>>
>>>> I would also like to see the runner comparison revamp at
>>>> https://issues.apache.org/jira/browse/BEAM-2888 which would help users
>>>> really understand what they can and cannot do in plain terms.
>>>>
>>>> Kenn
>>>>
>>>> On Fri, Apr 5, 2019 at 9:30 AM Ahmet Altay  wrote:
>>>>
>>>>> Thank you Aizhamal for volunteering. I am happy to help as an
>>>>> administrator.
>>>>>
>>>>> cc: +Rose Nguyen  +Melissa Pashniak
>>>>>  in case they will be interested in mentorship
>>>>> and/or administration.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Apr 5, 2019 at 9:16 AM Thomas Weise  wrote:
>>>>>
>>>>>> This is great. Beam documentation needs work in several areas, Python
>>>>>> SDK, portability and SQL come to mind right away :)
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 4, 2019 at 4:21 PM Aizhamal Nurmamat kyzy <
>>>>>> aizha...@google.com> wrote:
>>>>>>
>>>>>>> Hello everyone,
>>>>>>>
>>>>>>> As the ASF announced that each project can apply for Season of Docs
>>>>>>> individually, I would like to volunteer to be one of the administrators 
>>>>>>> for
&g

Re: Projects Can Apply Individually for Google Season of Docs

2019-04-23 Thread Pablo Estrada

That's excellent. Thanks Max!
Let's hope we'll be selected.
Best
-P.

On Tue, Apr 23, 2019, 5:36 AM Maximilian Michels  wrote:

> Both proposals for doc improvements sound great. Portability is an
> obvious one and the capability matrix needs an update as well.
>
> I might be a bit late to the party, but I'd like to help with the
> mentoring. I've filled out the mentor form.
>
> Thanks,
> Max
>
> On 22.04.19 23:32, Pablo Estrada wrote:
> > Hello all,
> > thanks to everyone for your participation. I have submitted the
> > application on behalf of Beam, and requested one technical writer. Let's
> > see how it goes : )
> > Best
> > -P.
> >
> > On Wed, Apr 17, 2019 at 10:09 PM Ahmet Altay  > <mailto:al...@google.com>> wrote:
> >
> > Thanks Aizhamal, I completed the forms.
> >
> > On Wed, Apr 17, 2019 at 6:46 PM Aizhamal Nurmamat kyzy
> > mailto:aizha...@google.com>> wrote:
> >
> > Hi everyone,
> >
> >
> > Here are a few updates on our application for Season of Docs:
> >
> >
> > 1. Pablo and I have created the following document [1] with some
> > of the project ideas shared in the mailing list. If you have
> > more ideas, please add them into the doc and provide
> > description. If you also want to be a mentor for the proposed
> > ideas, please add your name in the table.
> >
> >
> > 2. To submit our application, we need to publish our project
> > ideas list. For this we have opened a Jira tickets with
> > “gsod2019” tag[2]. We should maybe also think of adding a small
> > blog post in the Beam website that contains all the ideas in one
> > place[3]? Please let me know what you think on this.
> >
> >
> > 3. By next week Tuesday (Application Deadline)
> >
> >   *
> >
> > +pabl...@apache.org <mailto:pabl...@apache.org> , please
> > complete the org application form [4]
> >
> >   *
> >
> > @Ahmet Altay <mailto:al...@google.com> , please complete
> > alternative administrator form [5]
> >
> >   *
> >
> > @pabl...@apache.org <mailto:pabl...@apache.org> , @Ahmet
> > Altay <mailto:al...@google.com>  , and all other
> > contributors that want to participate as mentors, please
> > complete the mentor registration form [6]
> >
> >
> > Thank you,
> >
> > Aizhamal
> >
> >
> > [1]
> >
> https://docs.google.com/document/d/1FNf-BjB4Q7PDdqygPboLr7CyIeo6JAkrt0RBgs2I4dE/edit#
> >
> > [2]
> >
> https://issues.apache.org/jira/browse/BEAM-7104?jql=project%20%3D%20BEAM%20AND%20status%20%3D%20Open%20AND%20labels%20%3D%20gsod2019
> >
> > [3] https://beam.apache.org/blog/
> >
> > [4]
> >
> https://docs.google.com/forms/d/e/1FAIpQLScrEq5yKmadgn7LEPC8nN811-6DNmYvus5uXv_JY5BX7CH-Bg/viewform
> >
> > [5]
> >
> https://docs.google.com/forms/d/e/1FAIpQLSc5ZsBzqfsib-epktZp8bYxL_hO4RhT_Zz8AY6zXDHB79ue9g/viewform
> >
> > [6]
> >
> https://docs.google.com/forms/d/e/1FAIpQLSe-JjGvaKKGWZOXxrorONhB8qN3mjPrB9ZVkcsntR73Cv_K7g/viewform
> >
> > On Wed, Apr 10, 2019 at 2:57 PM Pablo Estrada
> > mailto:pabl...@google.com>> wrote:
> >
> > I'd be happy to be a mentor for this to help add getting
> > started documentation for Python on Flink. I'd want to focus
> > on the reviews and less on the administration - so I'm
> > willing to be a secondary administrator if that's necessary
> > to move forward, but I'd love it if someone would help
> > administer.
> > FWIW, neither the administrator nor any other mentor has to
> > be a committer.
> >
> > Anyone willing to be primary administrator and also a mentor?
> >
> > Thanks
> > -P.
> >
> > On Fri, Apr 5, 2019 at 9:40 AM Kenneth Knowles
> > mailto:k...@apache.org>> wrote:
> >
> > Yes, this is great. Thanks for noticing the call and
> > pushing ahead on this, Aizhamal!
> >
> > I would also like to see the runner comparison revamp at
> > https

Re: Integration of python/portable runner tests for Samza runner

2019-04-23 Thread Pablo Estrada

This is cool, Daniel : ) Glad to see the Samza runner moving forward.
Best
-P.

On Tue, Apr 23, 2019 at 2:52 AM Maximilian Michels  wrote:

> Hi Daniel,
>
> Note that there is also Portable Validates Runner which runs Java
> portability tests. I don't know if you have integrated with that one
> already.
>
> Thanks,
> Max
>
> On 23.04.19 02:28, Ankur Goenka wrote:
> > Hi Daniel,
> >
> > We use flinkCompatibilityMatrix [1] to check the Flink compatibility
> > with python. This is python equivalent to validatesRunner tests in java
> > for portable runners.
> > I think we can reuse it for Samza Portable runner with minor refactoring.
> >
> > [1]
> >
> https://github.com/apache/beam/blob/bdb1a713a120a887e71e85c77879dc4446a58541/sdks/python/build.gradle#L305
> >
> > On Mon, Apr 22, 2019 at 3:21 PM Daniel Chen  > > wrote:
> >
> > Hi everyone,
> >
> > I'm working on improving the validation of the Python portable Samza
> > runner. For java, we have the gradle task ( :validatesRunner) that
> > runs the runner validation tests.
> > I am looking for pointers on how to similarly integrate/enable the
> > portability and Python tests for the Samza runner.
> >
> > Any help will be greatly appreciated.
> >
> > Thanks,
> > Daniel
> >
>

Re: [Discuss] Publishing pre-release artifacts to repositories

2019-04-24 Thread Pablo Estrada

I think this is a great idea. A way of doing it for python would be by
using the test repository for PyPi[1], and that way we would not have to do
an official PyPi release, but still would be able to install it with pip
(by passing an extra flag), and test.

In fact, there are some Beam artifacts already in there[2]. At some point I
looked into this, but couldn't figure out who has access/the password for
it.

In short: +1, and I would suggest using the test PyPi repo to avoid
publishing to the main PyPi repo.
Best
-P.

[1] https://test.pypi.org/
[2] https://test.pypi.org/project/apache-beam/

On Wed, Apr 24, 2019 at 3:04 PM Ahmet Altay  wrote:

> Hi all,
>
> What do you think about the idea of publishing pre-release artifacts as
> part of the RC emails?
>
> For Python this would translate into publishing the same artifacts from RC
> email with a version like "2.X.0rcY" to pypi. I do not know, but I am
> guessing we can do a similar thing with Maven central for Java artifacts as
> well.
>
> Advantages would be:
> - Allow end users to validate RCs for their own purposes using the same
> exact process they will normally use.
>  - Enable early-adaptors to start using RC releases early on in the
> release cycle if that is what they would like to do. This will in turn
> reduce time pressure on some releases. Especially for cases like someone
> needs a release to be finalized for an upcoming event.
>
> There will also be disadvantages, some I could think of:
> - Users could request support for RC artifacts. Hopefully in the form of
> feedback for us to improve the release. But it could also be in the form of
> folks using RC artifacts for production for a long time.
> - It will add toil to the current release process, there will be one more
> step for each RC. I think for python this will be a small step but
> nevertheless it will be additional work.
>
> For an example of this, you can take a look at tensorflow releases. For
> 1.13 there were 3 pre-releases [1].
>
> Ahmet
>
> [1] https://pypi.org/project/tensorflow/#history
>

Add new JIRA component for Python IO?

2019-04-25 Thread Pablo Estrada

Hello all,
there are only two JIRA components for python: `sdk-py-core`, and
`sdk-py-harness`. Naturally, sdk-py-core is the component with the most
bugs (>1000).

I believe we really need a component to route issues for Python IO
(`io-python-all`?). Maybe even something more granular.

Thoughts?
-P.

Re: Add new JIRA component for Python IO?

2019-04-25 Thread Pablo Estrada

Thanks Kenn! <3

On Thu, Apr 25, 2019 at 2:07 PM Kenneth Knowles  wrote:

> Makes sense.
>
> I just added components for all the things I identified
> under sdks/python/apache_beam/io
>
> Kenn
>
> On Thu, Apr 25, 2019 at 12:43 PM Pablo Estrada  wrote:
>
>> Hello all,
>> there are only two JIRA components for python: `sdk-py-core`, and
>> `sdk-py-harness`. Naturally, sdk-py-core is the component with the most
>> bugs (>1000).
>>
>> I believe we really need a component to route issues for Python IO
>> (`io-python-all`?). Maybe even something more granular.
>>
>> Thoughts?
>> -P.
>>
>>
>>

Re: [BEAM-7164] Python precommit failing on Java PRs. dataflow:setupVirtualenv

2019-04-26 Thread Pablo Estrada

hm no, these are somewhat common. Yes, I think we could have retries to try
to fix this sort of problem.

Perhaps a mixture of reusing a virtualenv, and having retries when creating
it?

On Fri, Apr 26, 2019 at 11:15 AM Alex Amato  wrote:

> Okay but this occurred on jenkins. So does the machine need an update?
>
> On Fri, Apr 26, 2019 at 10:43 AM Valentyn Tymofieiev 
> wrote:
>
>> I think you hit a pypi flake.
>>
>> pip install futures>=2.2.0 works fine for me.
>>
>> On Fri, Apr 26, 2019 at 9:41 AM Alex Amato  wrote:
>>
>>> Would be nice to fix this as it can slow down PRs. I am not sure if this 
>>> one is fixed on retry yet or not.
>>>
>>>
>>>
>>> *https://issues.apache.org/jira/browse/BEAM-7164?filter=-2 
>>> *
>>>
>>>
>>>
>>> *https://builds.apache.org/job/beam_PreCommit_Python_Commit/6035/consoleFull
>>> *
>>>
>>>
>>> *18:05:44* >* Task 
>>> :beam-sdks-python-test-suites-dataflow:setupVirtualenv**18:05:44* New 
>>> python executable in 
>>> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-410805238/bin/python2.7*18:05:44*
>>>  Also creating executable in 
>>> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-410805238/bin/python*18:05:44*
>>>  Installing setuptools, pkg_resources, pip, wheel...done.*18:05:44* Running 
>>> virtualenv with interpreter /usr/bin/python2.7*18:05:44* DEPRECATION: 
>>> Python 2.7 will reach the end of its life on January 1st, 2020. Please 
>>> upgrade your Python as Python 2.7 won't be maintained after that date. A 
>>> future version of pip will drop support for Python 2.7.*18:05:44* 
>>> Collecting tox==3.0.0*18:05:44*   Using cached 
>>> https://files.pythonhosted.org/packages/e6/41/4dcfd713282bf3213b0384320fa8841e4db032ddcb80bc08a540159d42a8/tox-3.0.0-py2.py3-none-any.whl*18:05:44*
>>>  Collecting grpcio-tools==1.3.5*18:05:44*   Using cached 
>>> https://files.pythonhosted.org/packages/05/f6/0296e29b1bac6f85d2a8556d48adf825307f73109a3c2c17fb734292db0a/grpcio_tools-1.3.5-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
>>>  Collecting pluggy<1.0,>=0.3.0 (from tox==3.0.0)*18:05:44*   Using cached 
>>> https://files.pythonhosted.org/packages/84/e8/4ddac125b5a0e84ea6ffc93cfccf1e7ee1924e88f53c64e98227f0af2a5f/pluggy-0.9.0-py2.py3-none-any.whl*18:05:44*
>>>  Collecting six (from tox==3.0.0)*18:05:44*   Using cached 
>>> https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl*18:05:44*
>>>  Collecting virtualenv>=1.11.2 (from tox==3.0.0)*18:05:44*   Using cached 
>>> https://files.pythonhosted.org/packages/4f/ba/6f9315180501d5ac3e707f19fcb1764c26cc6a9a31af05778f7c2383eadb/virtualenv-16.5.0-py2.py3-none-any.whl*18:05:44*
>>>  Collecting py>=1.4.17 (from tox==3.0.0)*18:05:44*   Using cached 
>>> https://files.pythonhosted.org/packages/76/bc/394ad449851729244a97857ee14d7cba61ddb268dce3db538ba2f2ba1f0f/py-1.8.0-py2.py3-none-any.whl*18:05:44*
>>>  Collecting grpcio>=1.3.5 (from grpcio-tools==1.3.5)*18:05:44*   Using 
>>> cached 
>>> https://files.pythonhosted.org/packages/7c/59/4da8df60a74f4af73ede9d92a75ca85c94bc2a109d5f67061496e8d496b2/grpcio-1.20.0-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
>>>  Collecting protobuf>=3.2.0 (from grpcio-tools==1.3.5)*18:05:44*   Using 
>>> cached 
>>> https://files.pythonhosted.org/packages/ea/72/5eadea03b06ca1320be2433ef2236155da17806b700efc92677ee99ae119/protobuf-3.7.1-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
>>>  Collecting futures>=2.2.0; python_version < "3.2" (from 
>>> grpcio>=1.3.5->grpcio-tools==1.3.5)*18:05:44*   ERROR: Could not find a 
>>> version that satisfies the requirement futures>=2.2.0; python_version < 
>>> "3.2" (from grpcio>=1.3.5->grpcio-tools==1.3.5) (from versions: 
>>> none)*18:05:44* ERROR: No matching distribution found for futures>=2.2.0; 
>>> python_version < "3.2" (from grpcio>=1.3.5->grpcio-tools==1.3.5)*18:05:46* 
>>> *18:05:46* >* Task :beam-sdks-python-test-suites-dataflow:setupVirtualenv* 
>>> FAILED*18:05:46*
>>>
>>>  
>>> 
>>>
>>>
>>>
>>>

Re: [BEAM-7164] Python precommit failing on Java PRs. dataflow:setupVirtualenv

2019-04-26 Thread Pablo Estrada

Good points, Kenn : ) - I know Yifan looked into running tests using
Docker. I imagine a testing container could have a set of dependencies
installed, and reduce the amount of installations we perform from PyPi.
Best
-P.

On Fri, Apr 26, 2019 at 11:49 AM Kenneth Knowles  wrote:

> Flakiness in Java got a lot better when we put the Maven cache outside the
> wiped build directory. I am not sure about Gradle now... It is obviously
> less hermetic, but these things should be immutable so a cache is
> acceptable.
>
> Is there a way to achieve this for Python? For Maven/Gradle a package
> being downloaded does not imply it is available for import. If downloaded
> == importable then you risk masking errors.
>
> Kenn
>
> On Fri, Apr 26, 2019 at 11:44 AM Udi Meiri  wrote:
>
>> That runs the risk of overloading our test machines when the build goes
>> red.
>>
>> On Fri, Apr 26, 2019 at 11:29 AM Alex Amato  wrote:
>>
>>> It would be ideal to not need manual steps. If known flakey tests can be
>>> auto retried that would be a great improvement.
>>>
>>> On Fri, Apr 26, 2019 at 11:24 AM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
>>>> We could do something along the lines of retry with a back-off. Note
>>>> that Java tests also have this problem as we sometimes fail to fetch
>>>> packages from Maven Central.
>>>>
>>>> On Fri, Apr 26, 2019 at 11:19 AM Pablo Estrada 
>>>> wrote:
>>>>
>>>>> hm no, these are somewhat common. Yes, I think we could have retries
>>>>> to try to fix this sort of problem.
>>>>>
>>>>> Perhaps a mixture of reusing a virtualenv, and having retries when
>>>>> creating it?
>>>>>
>>>>> On Fri, Apr 26, 2019 at 11:15 AM Alex Amato 
>>>>> wrote:
>>>>>
>>>>>> Okay but this occurred on jenkins. So does the machine need an update?
>>>>>>
>>>>>> On Fri, Apr 26, 2019 at 10:43 AM Valentyn Tymofieiev <
>>>>>> valen...@google.com> wrote:
>>>>>>
>>>>>>> I think you hit a pypi flake.
>>>>>>>
>>>>>>> pip install futures>=2.2.0 works fine for me.
>>>>>>>
>>>>>>> On Fri, Apr 26, 2019 at 9:41 AM Alex Amato 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Would be nice to fix this as it can slow down PRs. I am not sure if 
>>>>>>>> this one is fixed on retry yet or not.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *https://issues.apache.org/jira/browse/BEAM-7164?filter=-2 
>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-7164?filter=-2>*
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *https://builds.apache.org/job/beam_PreCommit_Python_Commit/6035/consoleFull
>>>>>>>> <https://builds.apache.org/job/beam_PreCommit_Python_Commit/6035/consoleFull>*
>>>>>>>>
>>>>>>>>
>>>>>>>> *18:05:44* >* Task 
>>>>>>>> :beam-sdks-python-test-suites-dataflow:setupVirtualenv**18:05:44* New 
>>>>>>>> python executable in 
>>>>>>>> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-410805238/bin/python2.7*18:05:44*
>>>>>>>>  Also creating executable in 
>>>>>>>> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-410805238/bin/python*18:05:44*
>>>>>>>>  Installing setuptools, pkg_resources, pip, wheel...done.*18:05:44* 
>>>>>>>> Running virtualenv with interpreter /usr/bin/python2.7*18:05:44* 
>>>>>>>> DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 
>>>>>>>> 2020. Please upgrade your Python as Python 2.7 won't be maintained 
>>>>>>>> after that date. A future version of pip will drop support for Python 
>>>>>>>> 2.7.*18:05:44* Collecting tox==3.0.0*18:05:44*   Using cached 
>>>>>>>> https://files.pythonhosted.org/packages/e6/41/4dcfd713282bf3213b0384320fa8841e4db032ddcb80bc08a540159d42a8/tox-3.0.0-py2.py3-none-any.whl*18:05:44*
>>>>>>>>  Collecting grpcio-tools==1.3.5*18:05:44*   Using cached 
&g

[discuss] A tweak to the Python API for SDF?

2019-04-26 Thread Pablo Estrada

Hi all,
Sorry about the wall of text.
So, first of all, I thought about this while reviewing a PR by Boyuan with
an example of an SDF[1]. This is very exciting btw : ).

Anyway... I certainly have a limited view of the whole SDF effort, but I
think it's worth discussing this particular point about the API before
finalizing SDF and making it widely available. So here I go:

The Python API for SDF asks users to provide a restriction provider in
their process function signature. More or less the following:

class MyOwnLittleSDF(beam.DoFn):
  def process(self, element,
  restriction_tracker=MyOwnLittleRestrictionProvider()):
# My DoFn logic...

This is all fine, but something that I found a little odd is that the
restriction provider gets replaced at runtime with a restriction tracker:

class MyOwnLittleSDF(beam.DoFn):
  def process(self, element,
  restriction_tracker=MyOwnLittleRestrictionProvider()):
# This assert succeeds : )
assert not isinstance(restriction_tracker,
  MyOwnLittleRestrictionProvider)

After thinking a little bit about it, I realized that the default argument
simply allows us to inform the runner where to find the restriction
provider; but that the thing that we need at runtime is NOT the restriction
provider - but rather, the restriction tracker.

A similar pattern occurs with state and timers, where the runner needs to
know the sort of state, the coder for the values in that state (or the time
domain for timers); but the runtime parameter is different[2]. For state
and timers (and window, timestamp, pane, etc.) we provide a pattern where
users give a default value that is clearly a placeholder:
beam.DoFn.TimerParam, or beam.DoFn.StateParam.

In this case, the API is fairly similar, but (at least in my imagination),
it is much more clear about how the DoFnParam will be replaced with
something else at runtime. A similar change could be done for SDF:

class MyOwnLittleSDF(beam.DoFn):
  MY_RESTRICTION = \
  RestrictionSpec(provider=MyOwnLittleRestrictionProvider())

  def process(
  self, element,
  restriction_tracker=beam.DoFn.RestrictionParam(MY_RESTRICTION)):
# My DoFn logic..

Perhaps it is a good opportunity to consider this, since SDF is still in
progress.

Some pros:
- Consistent with other parameters that we pass to DoFn methods
- A bit more clear about what will happen at runtime

Some cons:
- SDF developers are "power users", and will have gone through the SDF
documentation. This point will be clear to them.
- This may create unnecessary work, and perhaps unintended consequences.
- I bet there's more

Thoughts?

-P.

[1] https://github.com/apache/beam/pull/8338
[2]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/userstate_test.py#L560-L586
 .

Re: [discuss] A tweak to the Python API for SDF?

2019-04-29 Thread Pablo Estrada

Thanks all,
 @Luke - I imagine that would be an improvement to the API, but this may be
harder as this is already available to users, and there are those who have
implemented SDFs under the current API. Would it be possible to make a
backwards-compatible change to the API here?

For the Python changes, I've proposed a pull request:
https://github.com/apache/beam/pull/8430 - it was smaller than I thought :
) - All comments welcome please.

+Boyuan Zhang  I am happy to wait for your
SyntheticSource PR to be merged and make the appropriate changes if you'd
like.
Best
-P.

On Mon, Apr 29, 2019 at 8:23 AM Lukasz Cwik  wrote:

> Would it make sense to also do this in the Java SDK?
>
> The would make the restriction provider also mirror the TimerSpec and
> StateSpec which use annotations similar to how its done in Python.
>
> On Mon, Apr 29, 2019 at 3:42 AM Robert Bradshaw 
> wrote:
>
>> +1 to introducing this Param for consistency (and making the
>> substitution more obvious), and I think SDF is still new/experimental
>> enough we can do this. I don't know if we need Spec in addition to
>> Param and Provider.
>>
>> On Sat, Apr 27, 2019 at 1:07 AM Chamikara Jayalath 
>> wrote:
>> >
>> >
>> >
>> > On Fri, Apr 26, 2019 at 3:43 PM Pablo Estrada 
>> wrote:
>> >>
>> >> Hi all,
>> >> Sorry about the wall of text.
>> >> So, first of all, I thought about this while reviewing a PR by Boyuan
>> with an example of an SDF[1]. This is very exciting btw : ).
>> >>
>> >> Anyway... I certainly have a limited view of the whole SDF effort, but
>> I think it's worth discussing this particular point about the API before
>> finalizing SDF and making it widely available. So here I go:
>> >>
>> >> The Python API for SDF asks users to provide a restriction provider in
>> their process function signature. More or less the following:
>> >>
>> >> class MyOwnLittleSDF(beam.DoFn):
>> >>   def process(self, element,
>> >>   restriction_tracker=MyOwnLittleRestrictionProvider()):
>> >> # My DoFn logic...
>> >>
>> >> This is all fine, but something that I found a little odd is that the
>> restriction provider gets replaced at runtime with a restriction tracker:
>> >>
>> >> class MyOwnLittleSDF(beam.DoFn):
>> >>   def process(self, element,
>> >>   restriction_tracker=MyOwnLittleRestrictionProvider()):
>> >> # This assert succeeds : )
>> >> assert not isinstance(restriction_tracker,
>> >>   MyOwnLittleRestrictionProvider)
>> >>
>> >> After thinking a little bit about it, I realized that the default
>> argument simply allows us to inform the runner where to find the
>> restriction provider; but that the thing that we need at runtime is NOT the
>> restriction provider - but rather, the restriction tracker.
>> >>
>> >> A similar pattern occurs with state and timers, where the runner needs
>> to know the sort of state, the coder for the values in that state (or the
>> time domain for timers); but the runtime parameter is different[2]. For
>> state and timers (and window, timestamp, pane, etc.) we provide a pattern
>> where users give a default value that is clearly a placeholder:
>> beam.DoFn.TimerParam, or beam.DoFn.StateParam.
>> >
>> >
>> > This is the way (new) DoFn work for Python SDK. SDK (harness)
>> identifies meanings of different (potential) arguments to a DoFn using
>> pre-defined default values.
>> >
>> >>
>> >>
>> >> In this case, the API is fairly similar, but (at least in my
>> imagination), it is much more clear about how the DoFnParam will be
>> replaced with something else at runtime. A similar change could be done for
>> SDF:
>> >>
>> >> class MyOwnLittleSDF(beam.DoFn):
>> >>   MY_RESTRICTION = \
>> >>   RestrictionSpec(provider=MyOwnLittleRestrictionProvider())
>> >>
>> >>   def process(
>> >>   self, element,
>> >>   restriction_tracker=beam.DoFn.RestrictionParam(MY_RESTRICTION)):
>> >> # My DoFn logic..
>> >
>> >
>> >
>> > If I understood correctly, what you propose is similar to the existing
>> solution but we add a XXXParam parameter for consistency ?
>> > I think this is fine and should be a relatively small change. Main
>> point is, SDK should be able to find out the Restrictio

Re: [Discuss] Publishing pre-release artifacts to repositories

2019-04-29 Thread Pablo Estrada

 at 6:15 PM Michael Luckey 
>>>> wrote:
>>>> >>>
>>>> >>> Hi,
>>>> >>>
>>>> >>> wouldn't that be in conflict with Apache release policy [1] ?
>>>> >>>
>>>> >>> [1] http://www.apache.org/legal/release-policy.html
>>>> >>>
>>>> >>> On Thu, Apr 25, 2019 at 1:35 AM Alan Myrvold 
>>>> wrote:
>>>> >>>>
>>>> >>>> Great idea. I like the RC candidates to follow as much as the
>>>> release artifact process as possible.
>>>> >>>>
>>>> >>>> On Wed, Apr 24, 2019 at 3:27 PM Ahmet Altay 
>>>> wrote:
>>>> >>>>>
>>>> >>>>> To clarify my proposal, I am proposing publishing to the
>>>> production pypi repository with an rc tag in the version. And in turn allow
>>>> users to depend on beam's rc version + all the other regular dependencies
>>>> users would have directly from pypi.
>>>> >>>>>
>>>> >>>>> Publishing to test pypi repo would also be helpful if test pypi
>>>> repo also mirrors other packages that exist in the production pypi
>>>> repository.
>>>> >>>>>
>>>> >>>>> On Wed, Apr 24, 2019 at 3:12 PM Pablo Estrada 
>>>> wrote:
>>>> >>>>>>
>>>> >>>>>> I think this is a great idea. A way of doing it for python would
>>>> be by using the test repository for PyPi[1], and that way we would not have
>>>> to do an official PyPi release, but still would be able to install it with
>>>> pip (by passing an extra flag), and test.
>>>> >>>>>>
>>>> >>>>>> In fact, there are some Beam artifacts already in there[2]. At
>>>> some point I looked into this, but couldn't figure out who has access/the
>>>> password for it.
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> I also don't know who owns beam package in test pypi repo. Does
>>>> anybody know?
>>>> >>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> In short: +1, and I would suggest using the test PyPi repo to
>>>> avoid publishing to the main PyPi repo.
>>>> >>>>>> Best
>>>> >>>>>> -P.
>>>> >>>>>>
>>>> >>>>>> [1] https://test.pypi.org/
>>>> >>>>>> [2] https://test.pypi.org/project/apache-beam/
>>>> >>>>>>
>>>> >>>>>> On Wed, Apr 24, 2019 at 3:04 PM Ahmet Altay 
>>>> wrote:
>>>> >>>>>>>
>>>> >>>>>>> Hi all,
>>>> >>>>>>>
>>>> >>>>>>> What do you think about the idea of publishing pre-release
>>>> artifacts as part of the RC emails?
>>>> >>>>>>>
>>>> >>>>>>> For Python this would translate into publishing the same
>>>> artifacts from RC email with a version like "2.X.0rcY" to pypi. I do not
>>>> know, but I am guessing we can do a similar thing with Maven central for
>>>> Java artifacts as well.
>>>> >>>>>>>
>>>> >>>>>>> Advantages would be:
>>>> >>>>>>> - Allow end users to validate RCs for their own purposes using
>>>> the same exact process they will normally use.
>>>> >>>>>>>   - Enable early-adaptors to start using RC releases early on
>>>> in the release cycle if that is what they would like to do. This will in
>>>> turn reduce time pressure on some releases. Especially for cases like
>>>> someone needs a release to be finalized for an upcoming event.
>>>> >>>>>>>
>>>> >>>>>>> There will also be disadvantages, some I could think of:
>>>> >>>>>>> - Users could request support for RC artifacts. Hopefully in
>>>> the form of feedback for us to improve the release. But it could also be in
>>>> the form of folks using RC artifacts for production for a long time.
>>>> >>>>>>> - It will add toil to the current release process, there will
>>>> be one more step for each RC. I think for python this will be a small step
>>>> but nevertheless it will be additional work.
>>>> >>>>>>>
>>>> >>>>>>> For an example of this, you can take a look at tensorflow
>>>> releases. For 1.13 there were 3 pre-releases [1].
>>>> >>>>>>>
>>>> >>>>>>> Ahmet
>>>> >>>>>>>
>>>> >>>>>>> [1] https://pypi.org/project/tensorflow/#history
>>>>
>>>

Re: [DISCUSS] (Forked thread) Beam issue triage & assignees

2019-05-01 Thread Pablo Estrada

Hi Kenn,
For my information... is the Needs Triage status automatically assigned to
new issues? Are users expected to give their issue the Needs Triage status
when they create it?
Thanks
-P.

On Wed, May 1, 2019 at 11:12 AM Kenneth Knowles  wrote:

> An update here: we have the new workflow in place. I have transitioned
> untriaged issues to the "Needs Triage" status" so they are very easy to
> find, and removed the obsolete triaged label.
>
> Please help to triage! You can just look at all issues with the Needs
> Triage status and make sure it is in the right component and priority makes
> sense, and maybe alert someone who might want to know about it.
>
> Kenn
>
> On Mon, Mar 4, 2019 at 9:23 AM Kenneth Knowles  wrote:
>
>> This effort to improve our triage is still ongoing. To recall:
>>
>> Issues are no longer automatically assigned, so we have to watch them!
>>
>> Here's a saved search for issues needing triage:
>> https://issues.apache.org/jira/issues/?filter=12345682
>>
>> Anyone can help out. Just make sure the issue is in a suitable component
>> and someone is assigned or mentioned so they'll get a notification, then
>> add the "triaged" tag.
>>
>> You can also subscribe to the filter to watch incoming issues.
>>
>> Kenn
>>
>> On Wed, Feb 6, 2019 at 9:04 PM Kenneth Knowles  wrote:
>>
>>> I re-triaged most issues where the creation date != last update. I
>>> worked through everyone with more issues than myself (which I have triaged
>>> regularly) and a few people with a few fewer issues.
>>>
>>> I didn't look as closely at issues that were filed by the assignee. So
>>> if you filed a bunch of issues that landed on yourself, take a look.
>>>
>>> If you have fewer than 30 issues assigned to you, please take a look at
>>> them now.
>>>
>>> Kenn
>>>
>>> On Wed, Feb 6, 2019 at 8:15 PM Kenneth Knowles  wrote:
>>>
 While we work with infra on this, let's remove the broken system and
 use tags. It is important that issues coming in are known to be untriaged,
 so instead of a "Needs Triage" label, we should use "triaged". So I will
 take these actions that everyone seems to agree on:

  - Remove default assignment from Jira configs
  - Unassign all issues from people with a huge number
  - Add "triaged" tag to issues that are assigned and have some
 meaningful recent activity

 I will use trial-and-error to figure out what looks OK for "huge
 number" and "meaningful recent activity".

 Kenn

 On Fri, Jan 11, 2019 at 3:20 PM Kenneth Knowles 
 wrote:

> Filed https://issues.apache.org/jira/browse/INFRA-17628 for the new
> status. The rest of 1-3 is self-service I think. I expect step 4 and 5 
> will
> need INFRA as well, but I/we should do what we can to make a very clear
> request.
>
> On Fri, Jan 11, 2019 at 12:54 PM Kenneth Knowles 
> wrote:
>
>> It sounds like there's a lot of consensus, pretty much on the action
>> items that Max and Ahmet suggested. I will start on these first steps if 
>> no
>> one objects:
>>
>> 0) Add a Needs Review status to our workflow
>> 1) Change new issues to be Unassigned and to be in status "Needs
>> Review"
>> 2) Unassign all issues from folks with > 30
>>
>> And I'm not sure if folks had more to say on these:
>>
>> 3) Use Wiki of multiple committers per component rather than Jira
>> component owners
>> 4) Automatically unassign stale issues that are just sitting on an
>> assignee
>> 5) Look into SLOs per issue priority and see how we can surface SLO
>> violations (reports and pings)
>>
>> Kenn
>>
>> On Thu, Jan 10, 2019 at 11:41 AM Scott Wegner 
>> wrote:
>>
>>> +1
>>>
>>> > 3) Ensure that each component's unresolved issues get looked at
>>> regularly
>>>
>>> This is ideal, but I also don't know how to get to this state.
>>> Starting with clear component ownership and expectations will help. If 
>>> the
>>> triaging process is well-defined, then members of the community can help
>>> for any components which need additional support.
>>>
>>> On Thu, Jan 10, 2019 at 12:21 AM Mikhail Gryzykhin <
>>> gryzykhin.mikh...@gmail.com> wrote:
>>>
 +1 to keep issues unassigned and reevaluate backlog from time to
 time.

 We can also auto-unassign if there was no activity on ticket for N
 days. Or we can have auto-mailed report that highlights stale assigned
 issues.

 On Thu, Jan 10, 2019 at 12:10 AM Robert Bradshaw <
 rober...@google.com> wrote:

> On Thu, Jan 10, 2019 at 3:20 AM Ahmet Altay 
> wrote:
> >
> > I agree with the proposals here. Initial state of "Needs Review"
> and blocking releases on untriaged issues will ensure that we will at 
> least
> look at every new issue once.
>

Fwd: Your application for Season of Docs 2019 was unsuccessful

2019-05-01 Thread Pablo Estrada

Hello all,
as you may already know, unfortunately our application for Season of Docs
was not successful. That's too bad : ) - but it's good that we were able to
produce a couple work items that can still be picked up by the community at
some point. Thanks to everyone who helped here.
Best
-P.

-- Forwarded message -
From: Andrew Chen 
Date: Tue, Apr 30, 2019 at 5:31 AM
Subject: Your application for Season of Docs 2019 was unsuccessful
To: 


Thank you for your interest and enthusiasm for Season of Docs.

We’re sorry to say that your organization’s application for Season of Docs
was not successful this year. Because 2019 is the program’s pilot year, we
were only able to accept 50 organizations out of the almost 200
applications submitted. There were many high quality applications, so the
selection of organizations was not easy.

Please do stay in touch with the progress of Season of Docs, and let us
know if you have any questions or feedback by emailing
season-of-docs-supp...@googlegroups.com. We are hoping to grow the
program's capacity in subsequent years, so please apply again next year.

Sincerely,

The Season of Docs team

Re: [ANNOUNCE] New committer announcement: Udi Meiri

2019-05-03 Thread Pablo Estrada

Thanks Udi and congrats!

On Fri, May 3, 2019 at 1:44 PM Kenneth Knowles  wrote:

> Hi all,
>
> Please join me and the rest of the Beam PMC in welcoming a new committer:
> Udi Meiri.
>
> Udi has been contributing to Beam since late 2017, starting with HDFS
> support in the Python SDK and continuing with a ton of Python work. I also
> will highlight his work on community-building infrastructure, including
> documentation, experiments with ways to find reviewers for pull requests,
> gradle build work, analyzing and reducing build times.
>
> In consideration of Udi's contributions, the Beam PMC trusts Udi with the
> responsibilities of a Beam committer [1].
>
> Thank you, Udi, for your contributions.
>
> Kenn
>
> [1] https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-
> committer
>

Beam at Google Summer of Code 2019

2019-05-06 Thread Pablo Estrada

Hello all,
it is my pleasure to share with everyone that Tanay Tummalapalli has been
accepted as a GSoC student with Beam, to implement support for File Loads
into BigQuery for streaming pipelines[1].

Tanay wrote a very strong proposal, and showed understanding of the tricky
streaming considerations that will play out in this project.

I speak on behalf of everyone welcoming you Tanay, and we'll be happy to
see your contributions to Beam. : )
Best
-P.

[1]
https://summerofcode.withgoogle.com/projects/?sp-search=Tanay#4999837794172928

Re: [discuss] A tweak to the Python API for SDF?

2019-05-08 Thread Pablo Estrada

Hello all,
The API has been updated for Python (See
https://github.com/apache/beam/pull/8430). Please, if you catch any
documentation that needs updating, flag to me or just propose the change : )

As for Java - we didn't end up determining whether it makes sense to update
the API as well. Thoughts from others?

In any case, I've filed https://jira.apache.org/jira/browse/BEAM-7250 to
track this for Java.

Best
-P.

On Mon, Apr 29, 2019 at 2:41 PM Lukasz Cwik  wrote:

> Pablo, all the SplittableDoFn stuff is marked as @Experimental so one is
> able to change it. There really is only one complicated one to change in
> Watch.java, the rest are quite straightforward.
>
> On Mon, Apr 29, 2019 at 2:23 PM Pablo Estrada  wrote:
>
>> Thanks all,
>>  @Luke - I imagine that would be an improvement to the API, but this may
>> be harder as this is already available to users, and there are those who
>> have implemented SDFs under the current API. Would it be possible to make a
>> backwards-compatible change to the API here?
>>
>> For the Python changes, I've proposed a pull request:
>> https://github.com/apache/beam/pull/8430 - it was smaller than I thought
>> : ) - All comments welcome please.
>>
>> +Boyuan Zhang  I am happy to wait for your
>> SyntheticSource PR to be merged and make the appropriate changes if you'd
>> like.
>> Best
>> -P.
>>
>> On Mon, Apr 29, 2019 at 8:23 AM Lukasz Cwik  wrote:
>>
>>> Would it make sense to also do this in the Java SDK?
>>>
>>> The would make the restriction provider also mirror the TimerSpec and
>>> StateSpec which use annotations similar to how its done in Python.
>>>
>>> On Mon, Apr 29, 2019 at 3:42 AM Robert Bradshaw 
>>> wrote:
>>>
>>>> +1 to introducing this Param for consistency (and making the
>>>> substitution more obvious), and I think SDF is still new/experimental
>>>> enough we can do this. I don't know if we need Spec in addition to
>>>> Param and Provider.
>>>>
>>>> On Sat, Apr 27, 2019 at 1:07 AM Chamikara Jayalath <
>>>> chamik...@google.com> wrote:
>>>> >
>>>> >
>>>> >
>>>> > On Fri, Apr 26, 2019 at 3:43 PM Pablo Estrada 
>>>> wrote:
>>>> >>
>>>> >> Hi all,
>>>> >> Sorry about the wall of text.
>>>> >> So, first of all, I thought about this while reviewing a PR by
>>>> Boyuan with an example of an SDF[1]. This is very exciting btw : ).
>>>> >>
>>>> >> Anyway... I certainly have a limited view of the whole SDF effort,
>>>> but I think it's worth discussing this particular point about the API
>>>> before finalizing SDF and making it widely available. So here I go:
>>>> >>
>>>> >> The Python API for SDF asks users to provide a restriction provider
>>>> in their process function signature. More or less the following:
>>>> >>
>>>> >> class MyOwnLittleSDF(beam.DoFn):
>>>> >>   def process(self, element,
>>>> >>   restriction_tracker=MyOwnLittleRestrictionProvider()):
>>>> >> # My DoFn logic...
>>>> >>
>>>> >> This is all fine, but something that I found a little odd is that
>>>> the restriction provider gets replaced at runtime with a restriction
>>>> tracker:
>>>> >>
>>>> >> class MyOwnLittleSDF(beam.DoFn):
>>>> >>   def process(self, element,
>>>> >>   restriction_tracker=MyOwnLittleRestrictionProvider()):
>>>> >> # This assert succeeds : )
>>>> >> assert not isinstance(restriction_tracker,
>>>> >>   MyOwnLittleRestrictionProvider)
>>>> >>
>>>> >> After thinking a little bit about it, I realized that the default
>>>> argument simply allows us to inform the runner where to find the
>>>> restriction provider; but that the thing that we need at runtime is NOT the
>>>> restriction provider - but rather, the restriction tracker.
>>>> >>
>>>> >> A similar pattern occurs with state and timers, where the runner
>>>> needs to know the sort of state, the coder for the values in that state (or
>>>> the time domain for timers); but the runtime parameter is different[2]. For
>>>> state and timers (and window, timestamp, pane, etc

[discuss] Reducing log verbosity for Python failures?

2019-05-08 Thread Pablo Estrada

Hello all,
Some tests in Python have the problem that when they fail, lots of internal
logging is dumped onto stdout, and we end up having to scroll way up to
find the actual stack trace for the failed test. This logging, as far as i
can tell, is dumping of fn api protos.

Does anyone use these logs to look into the test failure? I would like to
find a way to make these more compact, or maybe just stop logging them
(people who need them can choose to log them in their local setup?).

I lean towards making them more compact (by, for instance, writing
functions that log their information in a more compact fashion); but I
would like to hear thoughts from others.

So thoughts? : )
-P.

Re: [discuss] Reducing log verbosity for Python failures?

2019-05-08 Thread Pablo Estrada

My impression - and I might be wrong [see meme at 1], is that these are
logged in debug mode, but when the test fails, the test runner dumps them
all to stdout.
Best
-P.

[1]
https://i1.wp.com/gifrific.com/wp-content/uploads/2015/02/Chris-Farley-Oh-God-Theyre-Gonna-Know-Im-Dumb-Conan-Interview.gif?ssl=1

*From: *Ahmet Altay 
*Date: *Wed, May 8, 2019 at 3:13 PM
*To: *dev

+1 It is hard to debug with lots logs messages. And if anybody is using
> them for development we can make those logs debug logs and hide them by
> default.
>
> *From: *Robert Bradshaw 
> *Date: *Wed, May 8, 2019 at 3:01 PM
> *To: *dev
>
> +1 to making them significantly more compact in most cases.
>>
>> From: Pablo Estrada 
>> Date: Wed, May 8, 2019 at 11:35 PM
>> To: dev
>>
>> > Hello all,
>> > Some tests in Python have the problem that when they fail, lots of
>> internal logging is dumped onto stdout, and we end up having to scroll way
>> up to find the actual stack trace for the failed test. This logging, as far
>> as i can tell, is dumping of fn api protos.
>> >
>> > Does anyone use these logs to look into the test failure? I would like
>> to find a way to make these more compact, or maybe just stop logging them
>> (people who need them can choose to log them in their local setup?).
>> >
>> > I lean towards making them more compact (by, for instance, writing
>> functions that log their information in a more compact fashion); but I
>> would like to hear thoughts from others.
>> >
>> > So thoughts? : )
>> > -P.
>>
>

Re: Contributing Beam Kata (Java & Python)

2019-05-14 Thread Pablo Estrada

+1 on merging.

*From: *Reuven Lax 
*Date: *Tue, May 14, 2019 at 1:23 PM
*To: *dev
*Cc: *Lars Francke

I've been playing around with this that past day .or two, and it's great!
> I'm inclined to merge this PR (if nobody objects) so that others in the
> community can contribute more training katas.
>
> Reuven
>
> *From: *Ismaël Mejía 
> *Date: *Tue, Apr 23, 2019 at 6:43 AM
> *To: *Lars Francke
> *Cc: * 
>
> Thanks for answering Lars,
>>
>> The 'interesting' part is that the tutorial has a full IDE integrated
>> experience based on the Jetbrains edu platform [1]. So maybe
>> interesting to see if it could make sense to have projects like this
>> in the new trainings incubator project or if they became too platform
>> constrained.
>>
>> This contribution is valuable for Beam but the community may decide
>> that it makes sense for it to live at some moment at the trainings
>> project. I suppose also Henry could be interested in taking a look at
>> this [2].
>>
>> [1] https://www.jetbrains.com/education/
>> [2] https://incubator.apache.org/clutch/training.html
>>
>> On Tue, Apr 23, 2019 at 3:00 PM Lars Francke 
>> wrote:
>> >
>> > Thanks Ismaël.
>> >
>> > I must admit I'm a tad confused. What has JetBrains got to do with this?
>> > This looks pretty cool and specific to Beam though, or is this more
>> generic?
>> > But yeah something along those lines could be interesting for hands-on
>> type things in training.
>> >
>> > On Fri, Apr 19, 2019 at 12:10 PM Ismaël Mejía 
>> wrote:
>> >>
>> >> +lars.fran...@gmail.com who is in the Apache training project and may
>> >> be interested in this one or at least the JetBrains like approach.
>> >>
>> >> On Fri, Apr 19, 2019 at 12:01 PM Ismaël Mejía 
>> wrote:
>> >> >
>> >> > This looks great, nice for bringing this to the project Henry!
>> >> >
>> >> > On Fri, Apr 19, 2019 at 10:53 AM hsuryawira...@google.com
>> >> >  wrote:
>> >> > >
>> >> > > Thanks Altay.
>> >> > > I'll create it under "learning/" first as this is not exactly
>> example.
>> >> > > Please do let me know if it's not the right place.
>> >> > >
>> >> > > On 2019/04/18 22:49:47, Ahmet Altay  wrote:
>> >> > > > This looks great.
>> >> > > >
>> >> > > > +David Cavazos  was working on interactive
>> colab based
>> >> > > > examples (https://github.com/apache/beam/pull/7679) perhaps we
>> can have a
>> >> > > > shared place for these two similar things.
>> >> > > >
>> >> > >
>>
>

Re: [VOTE] Remove deprecated Java Reference Runner code from repository.

2019-05-14 Thread Pablo Estrada

+1

*From: *Lukasz Cwik 
*Date: *Tue, May 14, 2019 at 2:20 PM
*To: *dev

+1
>
> *From: *Daniel Oliveira 
> *Date: *Tue, May 14, 2019 at 2:19 PM
> *To: *dev
>
> Hello everyone,
>>
>> I'm calling for a vote on removing the deprecated Java Reference Runner
>> code. The PR for the change has already been tested and reviewed:
>> https://github.com/apache/beam/pull/8380
>>
>> [ ] +1, Approve merging the removal PR in it's current state
>> [ ] -1, Veto the removal PR (please provide specific comments)
>>
>> The vote will be open for at least 72 hours. Since this a vote on
>> code-modification, it is adopted if there are at least 3 PMC affirmative
>> votes and no vetoes.
>>
>> For those who would like context on why the Java Reference Runner is
>> being deprecated, the discussions took place in the following email threads:
>>
>>1. (8 Feb. 2019) Thoughts on a reference runner to invest in?
>>
>> 
>>  -
>>Decision to deprecate the Java Reference Runner and use the Python
>>FnApiRunner for those use cases instead.
>>2. (14 Mar. 2019) Python PVR Reference post-commit tests failing
>>
>> 
>>- Removal of Reference Runner Post-Commits from Jenkins, and discussion on
>>removal of code.
>>3. (25 Apr. 2019) Removing Java Reference Runner code
>>
>> 
>>- Discussion thread before this formal vote.
>>
>>

Re: [ANNOUNCE] New PMC Member: Pablo Estrada

2019-05-15 Thread Pablo Estrada

Thanks everyone for the encouragement, and thanks to the PMC for the
recognition. I am honored and grateful. :)
Best
-P.


*From: *Kenneth Knowles 
*Date: *Tue, May 14, 2019, 10:25 PM
*To: *dev

Hi all,
>
> Please join me and the rest of the Beam PMC in welcoming Pablo Estrada to
> join the PMC.
>
> Pablo first picked up BEAM-722 in October of 2016 and has been a steady
> part of the Beam community since then. In addition to technical work on
> Beam Python & Java & runners, I would highlight how Pablo grows Beam's
> community by helping users, working on GSoC, giving talks at Beam Summits
> and other OSS conferences including Flink Forward, and holding training
> workshops. I cannot do justice to Pablo's contributions in a single
> paragraph.
>
> Thanks Pablo, for being a part of Beam.
>
> Kenn
>

Re: Contributing Beam Kata (Java & Python)

2019-05-15 Thread Pablo Estrada

I think a blog post is a great idea. Would you be able to write one?
See here for an example of a PR adding one:
https://github.com/apache/beam/pull/8391

Best
-P.

*From: *hsuryawira...@google.com 
*Date: *Wed, May 15, 2019, 12:51 AM
*To: * 

Thanks for merging it Reuven!
>
> Quick question, would it be useful if we write a blog post on the Kata so
> that we can build more awareness for people to try out?
> I've also uploaded the course to Stepik which has seamless integration
> within the IDE for people to easily start the course.
>
> On 2019/05/14 20:35:07, Reuven Lax  wrote:
> > Merged
> >
> > *From: *Reza Rokni 
> > *Date: *Tue, May 14, 2019 at 1:29 PM
> > *To: * 
> > *Cc: *Lars Francke
> >
> > +1 :-)
> > >
> > > *From: *Lukasz Cwik 
> > > *Date: *Wed, 15 May 2019 at 04:29
> > > *To: *dev
> > > *Cc: *Lars Francke
> > >
> > > +1
> > >>
> > >> *From: *Pablo Estrada 
> > >> *Date: *Tue, May 14, 2019 at 1:27 PM
> > >> *To: *dev
> > >> *Cc: *Lars Francke
> > >>
> > >> +1 on merging.
> > >>>
> > >>> *From: *Reuven Lax 
> > >>> *Date: *Tue, May 14, 2019 at 1:23 PM
> > >>> *To: *dev
> > >>> *Cc: *Lars Francke
> > >>>
> > >>> I've been playing around with this that past day .or two, and it's
> > >>>> great! I'm inclined to merge this PR (if nobody objects) so that
> others in
> > >>>> the community can contribute more training katas.
> > >>>>
> > >>>> Reuven
> > >>>>
> > >>>> *From: *Ismaël Mejía 
> > >>>> *Date: *Tue, Apr 23, 2019 at 6:43 AM
> > >>>> *To: *Lars Francke
> > >>>> *Cc: * 
> > >>>>
> > >>>> Thanks for answering Lars,
> > >>>>>
> > >>>>> The 'interesting' part is that the tutorial has a full IDE
> integrated
> > >>>>> experience based on the Jetbrains edu platform [1]. So maybe
> > >>>>> interesting to see if it could make sense to have projects like
> this
> > >>>>> in the new trainings incubator project or if they became too
> platform
> > >>>>> constrained.
> > >>>>>
> > >>>>> This contribution is valuable for Beam but the community may decide
> > >>>>> that it makes sense for it to live at some moment at the trainings
> > >>>>> project. I suppose also Henry could be interested in taking a look
> at
> > >>>>> this [2].
> > >>>>>
> > >>>>> [1] https://www.jetbrains.com/education/
> > >>>>> [2] https://incubator.apache.org/clutch/training.html
> > >>>>>
> > >>>>> On Tue, Apr 23, 2019 at 3:00 PM Lars Francke <
> lars.fran...@gmail.com>
> > >>>>> wrote:
> > >>>>> >
> > >>>>> > Thanks Ismaël.
> > >>>>> >
> > >>>>> > I must admit I'm a tad confused. What has JetBrains got to do
> with
> > >>>>> this?
> > >>>>> > This looks pretty cool and specific to Beam though, or is this
> more
> > >>>>> generic?
> > >>>>> > But yeah something along those lines could be interesting for
> > >>>>> hands-on type things in training.
> > >>>>> >
> > >>>>> > On Fri, Apr 19, 2019 at 12:10 PM Ismaël Mejía  >
> > >>>>> wrote:
> > >>>>> >>
> > >>>>> >> +lars.fran...@gmail.com who is in the Apache training project
> and
> > >>>>> may
> > >>>>> >> be interested in this one or at least the JetBrains like
> approach.
> > >>>>> >>
> > >>>>> >> On Fri, Apr 19, 2019 at 12:01 PM Ismaël Mejía <
> ieme...@gmail.com>
> > >>>>> wrote:
> > >>>>> >> >
> > >>>>> >> > This looks great, nice for bringing this to the project Henry!
> > >>>>> >> >
> > >>>>> >> > On Fri, Apr 19, 2019 at 10:53 AM hsuryawira...@google.com
> > >>>>> >> >  wrote:
> > >>>>> >> > >
> > >>>>> >> > > Thanks Altay.
> > >>>>> >> > > I'll create it under "learning/" first as this is not
> exactly
> > >>>>> example.
> > >>>>> >> > > Please do let me know if it's not the right place.
> > >>>>> >> > >
> > >>>>> >> > > On 2019/04/18 22:49:47, Ahmet Altay 
> wrote:
> > >>>>> >> > > > This looks great.
> > >>>>> >> > > >
> > >>>>> >> > > > +David Cavazos  was working on
> > >>>>> interactive colab based
> > >>>>> >> > > > examples (https://github.com/apache/beam/pull/7679)
> perhaps
> > >>>>> we can have a
> > >>>>> >> > > > shared place for these two similar things.
> > >>>>> >> > > >
> > >>>>> >> > >
> > >>>>>
> > >>>>
> > >
> > > --
> > >
> > > This email may be confidential and privileged. If you received this
> > > communication by mistake, please don't forward it to anyone else,
> please
> > > erase all copies and attachments, and please let me know that it has
> gone
> > > to the wrong person.
> > >
> > > The above terms reflect a potential business arrangement, are provided
> > > solely as a basis for further discussion, and are not intended to be
> and do
> > > not constitute a legally binding obligation. No legally binding
> obligations
> > > will be created, implied, or inferred until an agreement in final form
> is
> > > executed in writing by all parties involved.
> > >
> >
>

Re: Postcommit kiosk dashboard

2019-05-17 Thread Pablo Estrada

The dashboard's here:
http://104.154.241.245/d/8N6LVeCmk/post-commits-status-dashboard?orgId=1

Or are you looking for something else in particular?
Best
-P.

On Fri, May 17, 2019 at 4:18 PM Kyle Weaver  wrote:

> Whatever happened to this dashboard? Having to manually maintain multiple
> lists of long links is a pain, and error-prone to boot.
>
> (Sorry for resurrecting a month-old thread)
>
> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
> | +1650203
>
>
> On Fri, Apr 19, 2019 at 2:53 AM Ismaël Mejía  wrote:
>
>> Catching up on this one, nice dashboard !
>> Some jobs are misisng e.g. validatesRunner for both Spark and Flink.
>> I suppose those are important if this may eventually replace the
>> README as Thomas suggests.
>>
>> On Fri, Mar 15, 2019 at 2:18 AM Thomas Weise  wrote:
>> >
>> > This is very nice!
>> >
>> > Perhaps it can also replace this manually maintained list?
>> https://github.com/apache/beam/blob/master/.test-infra/jenkins/README.md
>> >
>> >
>> > On Thu, Mar 14, 2019 at 1:01 PM Mikhail Gryzykhin 
>> wrote:
>> >>
>> >> Addressed comments:
>> >> 1. Added precommits.
>> >> 2. Limited timeframe to 7 days. This removed old jobs from table.
>> >> 2.1 We keep history of all jobs in separate DB that's used by grafana.
>> Some of deprecated jobs come from there.
>> >>
>> >> --Mikhail
>> >>
>> >> Have feedback?
>> >>
>> >>
>> >> On Thu, Mar 14, 2019 at 12:03 PM Michael Luckey 
>> wrote:
>> >>>
>> >>> Very nice!
>> >>>
>> >>> Two questions though:
>> >>> - the links on the left should point somewhere?
>> >>> - where are the beam_PostCommit_[Java|GO]_GradleBuild coming from?
>> Cant find them on Jenkins...
>> >>>
>> >>> On Thu, Mar 14, 2019 at 7:20 PM Mikhail Gryzykhin 
>> wrote:
>> >>>>
>> >>>> we already have https://s.apache.org/beam-community-metrics
>> >>>>
>> >>>> --Mikhail
>> >>>>
>> >>>> Have feedback?
>> >>>>
>> >>>>
>> >>>> On Thu, Mar 14, 2019 at 11:15 AM Pablo Estrada 
>> wrote:
>> >>>>>
>> >>>>> Woaahhh very fanc... this is great. Thanks so much. Love it. -
>> I also like the Code Velocity dashboard that you've added.
>> >>>>>
>> >>>>> Let's make these more discoverable. How about adding a shortlink?
>> s.apache.org/beam-dash ? : )
>> >>>>> Best
>> >>>>> -P.
>> >>>>>
>> >>>>> On Thu, Mar 14, 2019 at 10:58 AM Mikhail Gryzykhin <
>> mig...@google.com> wrote:
>> >>>>>>
>> >>>>>> Hi everyone,
>> >>>>>>
>> >>>>>> I've added a kiosk style post-commit status dashboard that can
>> help decorate your office space with green and red colors.
>> >>>>>>
>> >>>>>> Regards,
>> >>>>>> --Mikhail
>> >>>>>>
>> >>>>>> Have feedback?
>>
>

Re: Proposal: Add permanent url to community metrics dashboard

2019-05-17 Thread Pablo Estrada

We've talked about this offline. I love these dashboards, and use them
often. I'm in favor of doing this, but I'll let others chime in before
explicitly voting.

Another reason why I am in favor of this is that you've always been
responsive and willing to add improvements to it, so I can imagine them
becoming more useful in the future.

On Fri, May 17, 2019, 6:49 PM Mikhail Gryzykhin  wrote:

> Hello everyone,
>
> Some time ago we started community metrics dashboard.
>  However we never had added
> a permanent URL for it. This is really inconvenient to use, since only
> available way to access dashboard is by IP-address.
>
> In this tread I'd like to:
> 1. Vote to assign metrics.beam.apache.org to metrics dashboard (
> http://104.154.241.245).
> 2. Gather information on how to do it. I can assume only following steps
> so far: a) vote b) once vote is complete, contact Apache INFRA to help with
> this.
>
> Regards,
> Mikhail.
>
>

Re: Jira component for HDFS issues with Python SDK

2019-05-23 Thread Pablo Estrada

I've added the io-python-hadoop component [1]. We can continue to discuss
naming conventions for components[2].

The way Java is set up, we have:
sdk-java-*
io-java-*

For Python, we have:
sdk-py-*
io-python-*

We could be consistent with the old Java IO components by doing `io-py-*`.

I think doing `sdk-py-io-*` could be helpful if we believe that people file
Python IO issues by looking at sdk-py-* components first, but it would be
changing the convention that we had with Java.

Thoughts?

[1]
https://jira.apache.org/jira/issues/?jql=project+%3D+BEAM+AND+component+%3D+io-python-hadoop
[2]
https://jira.apache.org/jira/projects/BEAM?selectedItem=com.atlassian.jira.jira-projects-plugin:components-page

On Thu, May 23, 2019 at 9:35 AM Chamikara Jayalath 
wrote:

>
>
> On Thu, May 23, 2019 at 9:20 AM Valentyn Tymofieiev 
> wrote:
>
>> Hi,
>>
>> Could someone please help with addition io-python-hadoop or similar
>> component to Jira?
>>
>> Also, there is a small discrepancy in naming py vs python between:
>>
>> io-python-gcp and sdk-py-core - consider unifying them.
>>
>
> +1 on unifying. Probably better to have components sdk-py-core, sdk-py-io,
> sdk-py-io-gcp, sdk-py-io-hdfs, etc.
>
>
>>
>> Thank you!
>>
>

Re: Jira component for HDFS issues with Python SDK

2019-05-23 Thread Pablo Estrada

>
> My main point is that there should be a limited number of top level
> guessable prefixes. Looks like main prefixes today are runners-*, sdk-*,
> and io-*. As long as this list small it should be fine (which is the case
> today). Also, probably we document this structure somewhere. JIRA component
> selection is a long list of 55 items and only about 1/2 of that is visible
> initially.
>

I agree.

So it sounds like there are no objections (so far) for renaming
`io-python-*` components to `io-py-*`.

Is everyone okay with that?
Best
-P.


>
>
>>
>> [1]
>> https://jira.apache.org/jira/issues/?jql=project+%3D+BEAM+AND+component+%3D+io-python-hadoop
>> [2]
>> https://jira.apache.org/jira/projects/BEAM?selectedItem=com.atlassian.jira.jira-projects-plugin:components-page
>>
>> On Thu, May 23, 2019 at 9:35 AM Chamikara Jayalath 
>> wrote:
>>
>>>
>>>
>>> On Thu, May 23, 2019 at 9:20 AM Valentyn Tymofieiev 
>>> wrote:
>>>
 Hi,

 Could someone please help with addition io-python-hadoop or similar
 component to Jira?

 Also, there is a small discrepancy in naming py vs python between:

 io-python-gcp and sdk-py-core - consider unifying them.

>>>
>>> +1 on unifying. Probably better to have components sdk-py-core,
>>> sdk-py-io, sdk-py-io-gcp, sdk-py-io-hdfs, etc.
>>>
>>>

 Thank you!

>>>

PubSubIT tests topic cleanup?

2019-05-24 Thread Pablo Estrada

I've found a bunch of topics created by PubSub integration tests - they
dont seem to be getting cleaned up, perhaps?

614 name:
projects/apache-beam-testing/topics/integ-test-PubsubJsonIT-testSelectsPayloadContent-
614 name:
projects/apache-beam-testing/topics/integ-test-PubsubJsonIT-testSQLLimit-
222 name:
projects/apache-beam-testing/topics/integ-test-PubsubReadIT-testReadPublicData-

Re: PubSubIT tests topic cleanup?

2019-05-24 Thread Pablo Estrada

Seems like Mikhail created https://issues.apache.org/jira/browse/BEAM-6610 last
time ^^'

On Fri, May 24, 2019 at 11:58 AM Kenneth Knowles  wrote:

> Is there a jira tracking this?
>
> Kenn
>
> On Fri, May 24, 2019, 11:50 Andrew Pilloud  wrote:
>
>> This came up on the list in before in February:
>>
>> https://lists.apache.org/thread.html/38384d193e6f0af89f00a583e56cff93b18cfaebbf84e743eb900bc5@%3Cdev.beam.apache.org%3E
>>
>> We should be cleaning up topics, but it sounds like we aren't.
>>
>> Andrew
>>
>> On Fri, May 24, 2019 at 11:42 AM Pablo Estrada 
>> wrote:
>>
>>> I've found a bunch of topics created by PubSub integration tests - they
>>> dont seem to be getting cleaned up, perhaps?
>>>
>>> 614 name:
>>> projects/apache-beam-testing/topics/integ-test-PubsubJsonIT-testSelectsPayloadContent-
>>> 614 name:
>>> projects/apache-beam-testing/topics/integ-test-PubsubJsonIT-testSQLLimit-
>>> 222 name:
>>> projects/apache-beam-testing/topics/integ-test-PubsubReadIT-testReadPublicData-
>>>
>>

Re: [DISCUSS] Autoformat python code with Black

2019-05-27 Thread Pablo Estrada

This looks pretty good:) I know at least a couple people (myself included)
who've been annoyed by having to take care of lint issues that maybe a code
formatter could save us.
Thanks for sharing Ismael.
-P.


On Mon, May 27, 2019, 12:24 PM Ismaël Mejía  wrote:

> I stumbled by chance into Black [1] a python code auto formatter that
> is becoming the 'de-facto' auto-formatter for python, and wanted to
> bring to the ML Is there interest from the python people to get this
> into the build?
>
> The introduction of spotless for Java has been a good improvement and
> maybe the python code base may benefit of this too.
>
> WDYT?
>
> [1] https://github.com/python/black
>

Re: Shuffling on apache beam

2019-05-29 Thread Pablo Estrada

If you add a stateful DoFn to your pipeline, you'll force Beam to shuffle
data to their corresponding worker per key. I am not sure what is the
latency cost of doing this (as the messages still need to be shuffled). But
it may help you accomplish this without adding windowing+triggering.

-P.

On Wed, May 29, 2019 at 5:16 AM pasquale.bon...@gmail.com <
pasquale.bon...@gmail.com> wrote:

> Hi Reza,
> with GlobalWindow with triggering I was able to reduce hotspot issues
> gaining satisfying performance for BigTable update. Unfortunately latency
> when getting messages from PubSub remains around 1.5s that it's too much
> considering our NFR.
>
> This is the code I use to get the messages:
> PCollectionTuple rawTransactions = p //
> .apply("GetMessages",
>
> PubsubIO.readMessagesWithAttributes().withIdAttribute(TRANSACTION_MESSAGE_ID_FIELD_NAME)
>
> .withTimestampAttribute(TRANSACTION_MESSAGE_TIMESTAMP_FIELD_NAME).fromTopic(topic))
> .apply(Window.configure()
> .triggering(Repeatedly
>
> .forever(AfterWatermark.pastEndOfWindow()
> .withEarlyFirings(
> AfterProcessingTime
>
> .pastFirstElementInPane()
>
> .plusDelayOf(Duration.millis(1)))
> // Fire on any
> late data
>
> .withLateFirings(AfterPane.elementCountAtLeast(1
> .discardingFiredPanes())
>
> Messages are produced with a different dataflow:
>  Pipeline p = Pipeline.create(options);
> p.apply(
> "ReadFile",
> TextIO.read()
> .from(options.getInputLocation() + "/*.csv")
> .watchForNewFiles(
> // Check for new files every 1 seconds
> Duration.millis(600),
> // Never stop checking for new files
> Watch.Growth.never()))
> .apply(
> "create message",
> ParDo.of(
> new DoFn() {
>   @ProcessElement
>   public void processElement(ProcessContext context) {
> String line = context.element();
>
> String payload = convertRow(line);
> long now =
> LocalDateTime.now().atZone(ZoneId.systemDefault()).toInstant().toEpochMilli();
> context.output(
> new PubsubMessage(
> payload.getBytes(),
> ImmutableMap.of(TRANSACTION_MESSAGE_ID_FIELD_NAME,
> payload.split(",")[6],TRANSACTION_MESSAGE_TIMESTAMP_FIELD_NAME,
> Long.toString(now;
>   }
> }))
> .apply("publish message", PubsubIO.writeMessages().to(topic));
>
> I'm uploading a file containing 100 rows every 600 ms.
>
> I found different threads on satckoverflow around this latency issue, but
> none has a solution.
>
>
>
>
> On 2019/05/24 07:19:02, Reza Rokni  wrote:
> > PS You can also make use of the GlobalWindow with a stateful DoFn.
> >
> > On Fri, 24 May 2019 at 15:13, Reza Rokni  wrote:
> >
> > > Hi,
> > >
> > > Have you explored the use of triggers with your use case?
> > >
> > >
> > >
> https://beam.apache.org/releases/javadoc/2.12.0/org/apache/beam/sdk/transforms/windowing/Trigger.html
> > >
> > > Cheers
> > >
> > > Reza
> > >
> > > On Fri, 24 May 2019 at 14:14, pasquale.bon...@gmail.com <
> > > pasquale.bon...@gmail.com> wrote:
> > >
> > >> Hi Reuven,
> > >> I would like to know if is possible to guarantee that record are
> > >> processed by the same thread/task based on a key, as probably happens
> in a
> > >> combine/stateful operation, without adding the delay of a windows.
> > >> This could increase efficiency of caching and reduce same racing
> > >> condition when writing data.
> > >> I understand that workers are not part of programming model so I would
> > >> like to know if it's possible to achieve this behaviour reducing at
> minimum
> > >> the delay of windowing. We don't need any combine or state we just
> want the
> > >> all record with a given key are sent to same thread,
> > >>
> > >> Thanks
> > >>
> > >>
> > >> On 2019/05/24 03:20:13, Reuven Lax  wrote:
> > >> > Can you explain what you mean by worker? While every runner has
> workers
> > >> of
> > >> > course, workers are not part of the programming model.
> > >> >
> > >> > On Thu, May 23, 2019 at 8:13 PM pasquale.bon...@gmail.com <
> > >> > pasquale.bon...@gmail.com> wrote:
> > >> >
> > >> > > Hi all,
> > >> > > I would like to know if Apache Beam has a functionality similar to
> > >> > > fieldsGrouping in Storm that allows to send records to a specific
> > >> > > task/worker based on a key.
> > >> > > I know that we can achieve that with a combine/grouByKey
> operation but
> > >> > > that implies to add a windowing in our pipeline that we don't
> want.
> > >> > > I

Re: Support for PaneInfo in Python SDK

2019-05-30 Thread Pablo Estrada

Hi Tanay,
thanks for bringing this to the mailing list. I believe this is certainly
useful, and necessary. As an example, the fileio.WriteToFiles transform
does not work well without PaneInfo data (since we can't know how many
firings there are for each window, and we can't give names to files based
on this).

Best
-P.

On Thu, May 30, 2019 at 1:00 PM Tanay Tummalapalli 
wrote:

> Hi everyone,
>
> The PR linked in [BEAM-3759] - "Add support for PaneInfo descriptor in
> Python SDK"[1] was merged, but, the issue is still open.
> There might be some work left on this for full support for PaneInfo. Eg:
> Although the PaneInfo class exists, it is not accessible in a DoFn via a
> kwarg(PaneInfoParam) like TimestampParam or WindowParam.
>
> Please let me know the remaining work to be done on this issue as this may
> be needed in the near future.
>
> Regards
> Tanay Tummalapalli
>
> [1] https://issues.apache.org/jira/browse/BEAM-3759
>

Re: Jira tracker permission

2019-06-03 Thread Pablo Estrada

I've added you as contributor - welcome
-P.

On Mon, Jun 3, 2019, 9:16 PM Yichi Zhang  wrote:

> Hi, beam-dev,
>
> This is Yichi Zhang from Google, I just started looking into beam projects
> and will be actively working on beam sdk, could someone grant me permission
> to beam jira issue tracker? My jira username is yichi
> .
>
> Looking forward to work with everyone else.
>
> Thanks,
> Yichi
>

Re: [discuss] A tweak to the Python API for SDF?

2019-06-05 Thread Pablo Estrada

I have no objections.

+Ismaël Mejía  who has familiarity and interest in Java
SDF.

On Wed, Jun 5, 2019 at 11:31 AM Brian Hulette  wrote:

> Just wanted to resurrect this to say that it seems appropriate to make the
> same change in Java. All the same arguments apply there, and now there's
> the additional argument for maintaining symmetry with Python.
>
> I think BEAM-7250 should be changed to a ticket to actually implement this
> in Java unless someone has an objection.
>
> Brian
>
> On Wed, May 8, 2019 at 2:20 PM Pablo Estrada  wrote:
>
>> Hello all,
>> The API has been updated for Python (See
>> https://github.com/apache/beam/pull/8430). Please, if you catch any
>> documentation that needs updating, flag to me or just propose the change : )
>>
>> As for Java - we didn't end up determining whether it makes sense to
>> update the API as well. Thoughts from others?
>>
>> In any case, I've filed https://jira.apache.org/jira/browse/BEAM-7250 to
>> track this for Java.
>>
>> Best
>> -P.
>>
>> On Mon, Apr 29, 2019 at 2:41 PM Lukasz Cwik  wrote:
>>
>>> Pablo, all the SplittableDoFn stuff is marked as @Experimental so one is
>>> able to change it. There really is only one complicated one to change in
>>> Watch.java, the rest are quite straightforward.
>>>
>>> On Mon, Apr 29, 2019 at 2:23 PM Pablo Estrada 
>>> wrote:
>>>
>>>> Thanks all,
>>>>  @Luke - I imagine that would be an improvement to the API, but this
>>>> may be harder as this is already available to users, and there are those
>>>> who have implemented SDFs under the current API. Would it be possible to
>>>> make a backwards-compatible change to the API here?
>>>>
>>>> For the Python changes, I've proposed a pull request:
>>>> https://github.com/apache/beam/pull/8430 - it was smaller than I
>>>> thought : ) - All comments welcome please.
>>>>
>>>> +Boyuan Zhang  I am happy to wait for your
>>>> SyntheticSource PR to be merged and make the appropriate changes if you'd
>>>> like.
>>>> Best
>>>> -P.
>>>>
>>>> On Mon, Apr 29, 2019 at 8:23 AM Lukasz Cwik  wrote:
>>>>
>>>>> Would it make sense to also do this in the Java SDK?
>>>>>
>>>>> The would make the restriction provider also mirror the TimerSpec and
>>>>> StateSpec which use annotations similar to how its done in Python.
>>>>>
>>>>> On Mon, Apr 29, 2019 at 3:42 AM Robert Bradshaw 
>>>>> wrote:
>>>>>
>>>>>> +1 to introducing this Param for consistency (and making the
>>>>>> substitution more obvious), and I think SDF is still new/experimental
>>>>>> enough we can do this. I don't know if we need Spec in addition to
>>>>>> Param and Provider.
>>>>>>
>>>>>> On Sat, Apr 27, 2019 at 1:07 AM Chamikara Jayalath <
>>>>>> chamik...@google.com> wrote:
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Fri, Apr 26, 2019 at 3:43 PM Pablo Estrada 
>>>>>> wrote:
>>>>>> >>
>>>>>> >> Hi all,
>>>>>> >> Sorry about the wall of text.
>>>>>> >> So, first of all, I thought about this while reviewing a PR by
>>>>>> Boyuan with an example of an SDF[1]. This is very exciting btw : ).
>>>>>> >>
>>>>>> >> Anyway... I certainly have a limited view of the whole SDF effort,
>>>>>> but I think it's worth discussing this particular point about the API
>>>>>> before finalizing SDF and making it widely available. So here I go:
>>>>>> >>
>>>>>> >> The Python API for SDF asks users to provide a restriction
>>>>>> provider in their process function signature. More or less the following:
>>>>>> >>
>>>>>> >> class MyOwnLittleSDF(beam.DoFn):
>>>>>> >>   def process(self, element,
>>>>>> >>
>>>>>>  restriction_tracker=MyOwnLittleRestrictionProvider()):
>>>>>> >> # My DoFn logic...
>>>>>> >>
>>>>>> >> This is all fine, but something that I found a little odd is that
>>>>>> the restriction provider ge

Re: Design Proposal for Cost Estimation

2019-06-07 Thread Pablo Estrada

I've added you as a contributor! : )

On Fri, Jun 7, 2019 at 11:20 AM Alireza Samadian 
wrote:

> Hi,
>
> I am going to create Issues in Jira and start implementing row estimation
> of each source separately. I will appreciate if someone gives me the
> permission to assign Jira Issues to myself. My Jira id is riazela.
>
> Best,
> Alireza
>
> On Fri, May 31, 2019 at 3:54 PM Alireza Samadian 
> wrote:
>
>> Dear Members of Apache Beam Dev List,
>>
>> My name is Alireza; I am a Software Engineer Intern at Google, and I am
>> working closely with Anton on Beam SQL query optimizer. Currently, it uses
>> Apache Calcite without any cost estimation; I am proposing to implement the
>> cost estimator for it.
>> The first step would be implementing cost estimator for the sources; this
>> is my design proposal for this implementation. I will appreciate your
>> comments and suggestions.
>>
>>
>> https://docs.google.com/document/d/1vi1PBBu5IqSy-qZl1Gk-49CcANOpbNs1UAud6LnOaiY/edit#heading=h.6rlkpwwx7gvf
>>
>> Best,
>> Alireza Samadian
>>
>

[ANNOUNCEMENT] Common Pipeline Patterns - new section in the documentation + contributions welcome

2019-06-07 Thread Pablo Estrada

Hello everyone,
A group of community members has been working on gathering and providing
common pipeline patterns for pipelines in Beam. These are examples on how
to perform certain operations, and useful ways of using Beam in your
pipelines. Some of them relate to processing of files, use of side inputs,
sate/timers, etc. Check them out[1].

These initial patterns have been chosen based on evidence gathered from
StackOverflow, and from talking to users of Beam.

It would be great if this section could grow, and be useful to many Beam
users. For that reason, we invite anyone to share patterns, and pipeline
examples that they have used in the past. If you are interested in
contributing, please submit a pull request, or get in touch with Cyrus
Maden, Reza Rokni, Melissa Pashniak or myself.

Thanks!
Best
-P.

[1] https://beam.apache.org/documentation/patterns/overview/

Re: [Discuss] Create stackoverflow tags for python, java and go SDKs?

2019-07-09 Thread Pablo Estrada

I am able to create the tags if we decide to go that way:)

I have seen users relying on the 'python' tag, along with the 'apache-beam'
tag to specify an SDK. Maybe that's enough? What advantage would come from
a dedicated beam tag?

Perhaps users will be more easily accustomed to use the apache-beam-(SDK)
tags, and in that case, it makes sense to add them.

Thoughts?
,-P.




On Tue, Jul 9, 2019, 8:52 AM Lukasz Cwik  wrote:

> That sounds like a good idea to me.
>
> On Wed, Jul 3, 2019 at 10:45 AM Rui Wang  wrote:
>
>> Hi Community,
>>
>> When reading apache-beam related questions in stackoverflow, it happens
>> that some questions only mention version number(e.g. 2.8.0) but not mention
>> which SDK related. Sometimes I can tell which SDK it is from code snippets,
>> sometime I cannot as there is no code snippet. So in order to answer those
>> questions I need to first comment and ask which SDK.
>>
>> I noticed that there is no tag for a specific SDK for apache beam. Adding
>> such tags will be helpful when
>> 1. Questions with such tag tell which SDK it is talking about.
>> 2. If Questions do not mention SDK and without such tag, I can (or anyone
>> else) help tag them.
>>
>> Note that creating tags is a privilege in SO that requires >1500
>> reputation[1]. If people generally are ok with this idea, we will need to
>> ask for help in the community to see who could be able to create tags.
>>
>>
>> [1]: https://stackoverflow.com/help/privileges/create-tags
>>
>> Rui
>>
>

Re: [VOTE] Vendored dependencies release process

2019-07-09 Thread Pablo Estrada

Thanks Luke for writing a thorough guide for this.
+1

On Tue, Jul 9, 2019 at 1:27 PM Maximilian Michels  wrote:

> +1
>
> On 09.07.19 22:16, Udi Meiri wrote:
> > +1 LGTM
> >
> > On Mon, Jul 8, 2019 at 4:54 PM Lukasz Cwik  > > wrote:
> >
> > Thanks for taking a look. I followed up on your questions.
> >
> > On Mon, Jul 8, 2019 at 3:58 PM Udi Meiri  > > wrote:
> >
> > I left some comments. Being new to the Beam releasing process,
> > my question might be trivial to someone actually performing the
> > release.
> >
> > On Tue, Jul 2, 2019 at 4:49 PM Lukasz Cwik  > > wrote:
> >
> > Please vote based on the vendored dependencies release
> > process as discussed[1] and documented[2].
> >
> > Please vote as follows:
> > +1: Adopt the vendored dependency release process
> > -1: The vendored release process needs to change because ...
> >
> > Since many people in the US may be out due to the holiday
> > schedule, I'll try to close the vote and tally the results
> > on July 9th so please vote before then.
> >
> > 1:
> https://lists.apache.org/thread.html/e2c49a5efaee2ad416b083fbf3b9b6db60fdb04750208bfc34cecaf0@%3Cdev.beam.apache.org%3E
> > 2: https://s.apache.org/beam-release-vendored-artifacts
> >
>
>

Re: Jira Contributors List

2019-07-12 Thread Pablo Estrada

It seems that both have been added. Welcome!

On Fri, Jul 12, 2019 at 10:12 AM Rui Wang  wrote:

> Hi Francesco,
>
> What's your JIRA ID?
>
>
> -Rui
>
> On Thu, Jul 11, 2019 at 9:17 AM Francesco Perera 
> wrote:
>
>> Hi,
>> I am new to the beam community but  I am eager to contribute back. I am
>> going to work on this issue :
>> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail&selectedIssue=BEAM-7198
>> but can someone here add me to the contributors list in Jira ?
>>
>> Thanks,
>> Francesco
>> --
>> Francesco Perera Kuranapatabendige
>> fperer...@gmail.com | 646-719-6970 | www.linkedin.com/in/fperera
>>
>>
>>
>>
>

Re: Bucketed histogram metrics in beam. Anyone currently looking into this?

2019-07-12 Thread Pablo Estrada

I am not aware of anyone working on this. I do recall a couple things:

- These metrics can be very large in terms of space. Users may cause
themselves trouble if they define too many of them.
- Not enough reason not to do it, but certainly worth considering.
- There is some code added by Boyuan to develop highly efficient
histogram-type metrics.

Best
-P.

On Fri, Jul 12, 2019 at 10:21 AM Alex Amato  wrote:

> Hi,
>
> I was wondering if anyone has any plans to introduce bucketed histogram to
> beam (different from Distribution, which is just min, max, sum and count
> values)? I have some thoughts about how it could be done so that it
> integrates with stackdriver.
>
> Essentially I am referring to a timeseries of histograms, displaying
> buckets of values at fixed windows in time.
>

Re: PostCommit tests broken

2019-07-15 Thread Pablo Estrada

Hello all,
I've discussed this with Valentyn, but I'm looking at the BigQuery failures.
Best
-P.

On Mon, Jul 15, 2019 at 2:40 PM Valentyn Tymofieiev 
wrote:

> Thanks Kasia.
>
> Below are the most frequent flakes I have encountered in postcommit
> failures, but there may be more. Let's track resolution in respective
> Jira's and avoid re-creating duplicates.
>
> https://issues.apache.org/jira/browse/BEAM-7527 - ModuleNotFoundError (No
> module named 'beam_runner_api_pb2') in Python 3 test suites that run in
> parallel.
> https://issues.apache.org/jira/browse/BEAM-5874 - BigQueryQueryToTableIT
> is flaky on direct runner, incomplete assertion: expected checksum is
> e1fbcb5ca479a5ca5f9ecf444d6998beee4d44c6)  but: [some console output]
> https://issues.apache.org/jira/browse/BEAM-7749 - Occasional flakes due
> to insufficient quota: enough quota in apache-beam-testing.
> https://issues.apache.org/jira/browse/BEAM-7463 - BigQueryQueryToTableIT
> is flaky on direct runner (incorrect checksum)
> https://issues.apache.org/jira/browse/BEAM-6202 - Dataflow runner
> occasionally fails with "Job did not reach to a terminal state after
> waiting indefinitely", while job succeeds.
> https://issues.apache.org/jira/browse/BEAM-7546 - Portable
> WordCount-on-Flink Precommit is flaky - temporary folder not found. Fix
> submitted, awaiting verification.
>
> I'll try to find owners for issues that currently don't have an owner.
>
> Thanks,
> Valentyn
>
>
> On Mon, Jul 15, 2019 at 9:17 AM Katarzyna Kucharczyk <
> ka.kucharc...@gmail.com> wrote:
>
>> Hi,
>>
>> this is something we also observed in Python LoadTests. I created another
>> issue on Friday issue [BEAM-7733]
>>  because I didn't see
>> the one you shared, Valentyn. Whole problem seems to be a bit similar to this
>> ([BEAM-5529])  but I am
>> not sure if a pipeline in the resolved issue had same HTTP error.
>>
>> Thanks,
>> Kasia
>>
>> On Mon, Jul 15, 2019 at 7:11 AM Valentyn Tymofieiev 
>> wrote:
>>
>>> This is actually a quite frequent error right now.
>>> https://issues.apache.org/jira/browse/BEAM-6202 is opened to make
>>> Dataflow runner more robust when polling status of a running Dataflow job.
>>>
>>> On Wed, Mar 6, 2019 at 8:18 PM Valentyn Tymofieiev 
>>> wrote:
>>>
 Hi Juta,

 I also looked into the precommit errors and seems like the errors are
 associated with a disruption around that time unrelated to Beam or Beam
 test infrastructure. Please try again.

 Thanks,
 Valentyn

 On Wed, Mar 6, 2019 at 4:18 PM Mark Liu  wrote:

> I saw HttpError from the console log
>  in the
> job you provided above
> :
>
> "error": {
> "code": 404,
> "message": "(b8b815bd6683759f): Information about job
> 2019-03-06_04_32_19-1780414729477627626 could not be found in our system.
> Please double check the id is correct. If it is please contact customer
> support.",
> "status": "NOT_FOUND"
> }
>
> Looks like it's caused by Dataflow service flaky, not your test. Since
> the current postcommit
>  is
> green now, you can run postcommit again.
>
> Mark
>
>
> On Wed, Mar 6, 2019 at 10:04 AM Juta Staes  wrote:
>
>> Hi all,
>>
>> I am working on adding Python 3 integration tests to the post commit
>> tests.
>>
>> Since yesterday none of the post commit tests seem to pass. There are
>> errors in the build jobs as can be seen in the Post-Commit Tests Status
>> table at https://github.com/apache/beam/pull/7946 (and in other prs
>> as well) and at
>> https://builds.apache.org/job/beam_PostCommit_Python3_Verify_PR/77/console
>>
>> AssertionError: Job did not reach to a terminal state after waiting 
>> indefinitely.
>>
>>
>> Yesterday my tests did pass:
>> https://builds.apache.org/job/beam_PostCommit_Python3_Verify_PR/61/console
>>  but
>> failed on a build timed out.
>>
>> Does someone has any idea what is happening with the post commit
>> tests?
>>
>> Thanks!
>>
>> --
>>
>> [image: https://ml6.eu] 
>>
>> * Juta Staes*
>> ML6 Gent
>> 
>>
>>  DISCLAIMER 
>> This email and any files transmitted with it are confidential and
>> intended solely for the use of the individual or entity to whom they are
>> addressed. If you have received this email in error please notify the
>> system manager. This message contains confidential information and i

Re: ParDo Execution Time stat is always 0

2019-07-15 Thread Pablo Estrada

@Thomas do you think this is a problem of documentation, or a missing
feature?

We did not add support for it without cython because the cost of locking
and checking every 200ms in Python would be too high - that's why this is
only implemented in the optimized Cython codepath. I think it makes sense
to document this, rather than adding the support, as it would be really
expensive. What are your thoughts?

Best
-P.

On Mon, Jul 15, 2019, 1:48 PM Thomas Weise  wrote:

> That's great, but I think the JIRA needs to remain open since w/o Cython
> the metric still doesn't work.
>
> It would however be helpful to add a comment regarding your findings.
>
>
> On Mon, Jul 15, 2019 at 1:46 PM Rakesh Kumar  wrote:
>
>>
>> Installing cython in the application environment fixed the issue. Now I
>> am able to see the operator metrics ({organization_specific_prefix}
>> .operator.beam-metric-pardo_execution_time-process_bundle_
>> msecs-v1.gauge.mean)
>>
>> Thanks Ankur for looking into it and providing support.
>>
>> I am going to close  https://issues.apache.org/jira/browse/BEAM-7058 if
>> no one has any objection?
>>
>>
>> On Thu, Apr 11, 2019 at 7:13 AM Thomas Weise  wrote:
>>
>>> Tracked as https://issues.apache.org/jira/browse/BEAM-7058
>>>
>>>
>>> On Wed, Apr 10, 2019 at 11:38 AM Pablo Estrada 
>>> wrote:
>>>
>>>> This sounds like a bug then? +Alex Amato 
>>>>
>>>> On Wed, Apr 10, 2019 at 3:59 AM Maximilian Michels 
>>>> wrote:
>>>>
>>>>> Hi @all,
>>>>>
>>>>>  From a quick debugging session, I conclude that the wiring is in
>>>>> place
>>>>> for the Flink Runner. There is a ProgressReporter that reports
>>>>> MonitoringInfos to Flink, in a similar fashion as the "legacy" Runner.
>>>>>
>>>>> The bundle duration metrics are 0, but the element count gets reported
>>>>> correctly. It appears to be an issue of the Python/Java harness
>>>>> because
>>>>> "ProcessBundleProgressResponse" contains only 0 values for the bundle
>>>>> duration.
>>>>>
>>>>> Thanks,
>>>>> Max
>>>>>
>>>>> On 04.04.19 19:54, Mikhail Gryzykhin wrote:
>>>>> > Hi everyone,
>>>>> >
>>>>> > Quick summary on python and Dataflow Runner:
>>>>> > Python SDK already reports:
>>>>> > - MSec
>>>>> > - User metrics (int64 and distribution)
>>>>> > - PCollection Element Count
>>>>> > - Work on MeanByteCount for pcollection is ongoing here
>>>>> > <https://github.com/apache/beam/pull/8062>.
>>>>> >
>>>>> > Dataflow Runner:
>>>>> > - all metrics listed above are passed through to Dataflow.
>>>>> >
>>>>> > Ryan can give more information on Flink Runner. I also see
>>>>> Maximilian on
>>>>> > some of relevant PRs, so he might comment on this as well.
>>>>> >
>>>>> > Regards,
>>>>> > Mikhail.
>>>>> >
>>>>> >
>>>>> > On Thu, Apr 4, 2019 at 10:43 AM Pablo Estrada >>>> > <mailto:pabl...@google.com>> wrote:
>>>>> >
>>>>> > Hello guys!
>>>>> > Alex, Mikhail and Ryan are working on support for metrics in the
>>>>> > portability framework. The support on the SDK is pretty advanced
>>>>> > AFAIK*, and the next step is to get the metrics back into the
>>>>> > runner. Lukazs and myself are working on a project that depends
>>>>> on
>>>>> > this too, so I'm adding everyone so we can get an idea of what's
>>>>> > missing.
>>>>> >
>>>>> > I believe:
>>>>> > - User metrics are fully wired up in the SDK
>>>>> > - State sampler (timing) metrics are wired up as well (is that
>>>>> > right, +Alex Amato <mailto:ajam...@google.com>?)
>>>>> > - Work is ongoing to send the updates back to Flink.
>>>>> > - What is the plan for making metrics queriable from Flink? +Ryan
>>>>> > Williams <mailto:r...@runsascoded.com>
>>>>> >
>>>>> > Thanks!
>>

Re: [VOTE] Vendored Dependencies Release

2019-07-15 Thread Pablo Estrada

+1
verified hashes and signatures

On Fri, Jul 12, 2019 at 9:40 AM Kai Jiang  wrote:

> +1 (non-binding)
>
> On Thu, Jul 11, 2019 at 8:27 PM Lukasz Cwik  wrote:
>
>> Please review the release of the following artifacts that we vendor:
>>  * beam-vendor-grpc_1_21_0
>>  * beam-vendor-guava-26_0-jre
>>  * beam-vendor-bytebuddy-1_9_3
>>
>> Hi everyone,
>> Please review and vote on the release candidate #3 for the
>> org.apache.beam:beam-vendor-grpc_1_21_0:0.1,
>> org.apache.beam:beam-vendor-guava-26_0-jre:0.1, and
>> org.apache.beam:beam-vendor-bytebuddy-1_9_3:0.1 as follows:
>> [ ] +1, Approve the release
>> [ ] -1, Do not approve the release (please provide specific comments)
>>
>>
>> The complete staging area is available for your review, which includes:
>> * the official Apache source release to be deployed to dist.apache.org
>> 
>> [1], which is signed with the key with fingerprint
>> EAD5DE293F4A03DD2E77565589E68A56E371CCA2 [2],
>> * all artifacts to be deployed to the Maven Central Repository [3],
>> * commit hash "0fce2b88660f52dae638697e1472aa108c982ae6" [4],
>>
>> The vote will be open for at least 72 hours. It is adopted by majority
>> approval, with at least 3 PMC affirmative votes.
>>
>> Thanks,
>> Luke
>>
>> [1] https://dist.apache.org/repos/dist/dev/beam/vendor/
>> 
>> [2] https://dist.apache.org/repos/dist/release/beam/KEYS
>> 
>> [3]
>> https://repository.apache.org/content/repositories/orgapachebeam-1078/
>> 
>> [4]
>> https://github.com/apache/beam/commit/0fce2b88660f52dae638697e1472aa108c982ae6
>> 
>>
>

Re: ParDo Execution Time stat is always 0

2019-07-15 Thread Pablo Estrada

State sampler is the only state provider for the Python SDK. This means
that the Metrics module relies on it to attribute metrics to each step; and
the logging module also uses it to attribute logs to each step.
statesampler_slow does not implement the actual sampling, but it does
implement the state-provider functionality that other modules rely on.

As for reducing the sampling frequency - in fact, the problem is that the
lock needs to be acquired on every change of state[1], which is a very hot
path for the Python SDK. In statesampler_slow we do this by appending the
state to a stack defined via a Python list (a very efficient operation in
Python)[2].

"But Pablo, how about the locks that the status-reporting thread needs to
use to check counters while the main thread executes a work item? Surely
this lock is necessary in Python and Cython!"
- The answer here is that we do not do any locking for the Python-only
implementation, because the status-reporting thread only reads data, and we
can handle a few inconsistencies (since final numbers are committed upon
work item completion).
- The sampling thread, on the other hand, writes data into the counters, so
it's necessary[3] to use locking.

I do agree that we should be better at letting the user know that their
msec metrics are not being collected

[1]
https://github.com/pabloem/beam/blob/9cf64f57fcdea9a82360ed9b01a43ef25cdb991d/sdks/python/apache_beam/runners/worker/statesampler_fast.pyx#L215-L220
[2]
https://github.com/pabloem/beam/blob/9cf64f57fcdea9a82360ed9b01a43ef25cdb991d/sdks/python/apache_beam/runners/worker/statesampler_slow.py#L53-L55
[3]
https://github.com/pabloem/beam/blob/9cf64f57fcdea9a82360ed9b01a43ef25cdb991d/sdks/python/apache_beam/runners/worker/statesampler_fast.pyx#L132-L133

On Mon, Jul 15, 2019 at 5:04 PM Alex Amato  wrote:

> Perhaps no metric at all should be returned, instead of 0, which is an
> incorrect value.
>
> Also, is there a reason to have state_sampler_slow at all then, if its not
> intended to be implemented?
>
> On Mon, Jul 15, 2019 at 5:03 PM Kyle Weaver  wrote:
>
>> Pablo, what about setting a lower sampling rate? Or would that lead to
>> poor results?
>>
>> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
>> | +1650203
>>
>>
>> On Mon, Jul 15, 2019 at 4:44 PM Pablo Estrada  wrote:
>>
>>> @Thomas do you think this is a problem of documentation, or a missing
>>> feature?
>>>
>>> We did not add support for it without cython because the cost of locking
>>> and checking every 200ms in Python would be too high - that's why this is
>>> only implemented in the optimized Cython codepath. I think it makes sense
>>> to document this, rather than adding the support, as it would be really
>>> expensive. What are your thoughts?
>>>
>>> Best
>>> -P.
>>>
>>> On Mon, Jul 15, 2019, 1:48 PM Thomas Weise  wrote:
>>>
>>>> That's great, but I think the JIRA needs to remain open since w/o
>>>> Cython the metric still doesn't work.
>>>>
>>>> It would however be helpful to add a comment regarding your findings.
>>>>
>>>>
>>>> On Mon, Jul 15, 2019 at 1:46 PM Rakesh Kumar 
>>>> wrote:
>>>>
>>>>>
>>>>> Installing cython in the application environment fixed the issue. Now
>>>>> I am able to see the operator metrics ({organization_specific_prefix}
>>>>> .operator.beam-metric-pardo_execution_time-process_bundle_
>>>>> msecs-v1.gauge.mean)
>>>>>
>>>>> Thanks Ankur for looking into it and providing support.
>>>>>
>>>>> I am going to close  https://issues.apache.org/jira/browse/BEAM-7058 if
>>>>> no one has any objection?
>>>>>
>>>>>
>>>>> On Thu, Apr 11, 2019 at 7:13 AM Thomas Weise  wrote:
>>>>>
>>>>>> Tracked as https://issues.apache.org/jira/browse/BEAM-7058
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 10, 2019 at 11:38 AM Pablo Estrada 
>>>>>> wrote:
>>>>>>
>>>>>>> This sounds like a bug then? +Alex Amato 
>>>>>>>
>>>>>>> On Wed, Apr 10, 2019 at 3:59 AM Maximilian Michels 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi @all,
>>>>>>>>
>>>>>>>>  From a quick debugging session, I conclude that the wiring is in
>>>>>>>> place
>>>>>>>> for the Flink Runner

Live fixing of a Beam bug on July 25 at 3:30pm-4:30pm PST

2019-07-16 Thread Pablo Estrada

Hello all,
I'll be having a session where I live-fix a Beam bug for 1 hour next week.
Everyone is invited.

It will be on July 25, between 3:30pm and 4:30pm PST. Hopefully I will
finish a full change in that time frame, but we'll see.

I have not yet decided if I will do this via hangouts, or via a youtube
livestream. In any case, I will share the link here in the next few days.

I will most likely work on the Java SDK (I have a little feature request in
mind).

Thanks!
-P.

Re: Live fixing of a Beam bug on July 25 at 3:30pm-4:30pm PST

2019-07-17 Thread Pablo Estrada

Yes! So I will be working on a small feature request for Java's BigQueryIO:
https://issues.apache.org/jira/browse/BEAM-7607

Maybe I'll do something for Python next month. : )
Best
-P.

On Wed, Jul 17, 2019 at 12:32 PM Rakesh Kumar  wrote:

> +1, I really appreciate this initiative. It would be really helpful
> newbies like me.
>
> Is it possible to list out what are the things that you are planning to
> cover?
>
>
>
>
> On Tue, Jul 16, 2019 at 11:19 AM Yichi Zhang  wrote:
>
>> Thanks for organizing this Pablo, it'll be very helpful!
>>
>> On Tue, Jul 16, 2019 at 10:57 AM Pablo Estrada 
>> wrote:
>>
>>> Hello all,
>>> I'll be having a session where I live-fix a Beam bug for 1 hour next
>>> week. Everyone is invited.
>>>
>>> It will be on July 25, between 3:30pm and 4:30pm PST. Hopefully I will
>>> finish a full change in that time frame, but we'll see.
>>>
>>> I have not yet decided if I will do this via hangouts, or via a youtube
>>> livestream. In any case, I will share the link here in the next few days.
>>>
>>> I will most likely work on the Java SDK (I have a little feature request
>>> in mind).
>>>
>>> Thanks!
>>> -P.
>>>
>>

Re: Proposal: Add permanent url to community metrics dashboard

2019-07-17 Thread Pablo Estrada

I'd like to move this forward. Mikhail, would you be interested in filing
an issue with Infra to see if it's possible? I can do it if you prefer.

It seems that the concerns related to these dashboards showing up in search
results have been addressed. Does the community have any other concern
around this before we can move it forward?
Best
-P.

On Wed, May 22, 2019 at 8:53 AM Kenneth Knowles  wrote:

> I suggest asking infra about the best way to proceed, so that we don't
> vote on something that doesn't work for them. This might be something handy
> to spin up easily for any Apache project using similar tools.
>
> Kenn
>
> On Tue, May 21, 2019 at 1:02 PM Mikhail Gryzykhin 
> wrote:
>
>> Current http://104.154.241.245/robots.txt is already disallow all, so we
>> are good here.
>>
>> On Tue, May 21, 2019 at 12:57 PM Ahmet Altay  wrote:
>>
>>> If SSL is a concern that makes sense, I am not familiar with that enough
>>> to suggest whether another way to do this exists or not.
>>>
>>> It will be good to check that we can set robots.txt properly from the
>>> begging if we go down this path.
>>>
>>> On Mon, May 20, 2019 at 10:54 AM Mikhail Gryzykhin 
>>> wrote:
>>>
 @Ahmet Altay 
 Thank you for the comment.

 Point on search engines is really good. If that happens we can look
 into configuring robots.txt to notify search engines to ignore whole 
 domain.
 The link is a redirect to static IP. So it is still confusing.

 Having domain name will allow for getting SSL associated with it and
 will allow to keep same address even if IP changes (say we want to move to
 other hoster).

>>>
>>> I suppose short link will also allow us to change the host very similar
>>> to a domain name. That is a minor point anyway.
>>>
>>>

 Given two points above, I still consider that having explicit name will
 be beneficial. If there's some other way to get SSL cert and benefit of
 static name I'm eager to utilize it.

 On Mon, May 20, 2019 at 10:43 AM Ahmet Altay  wrote:

> Hi Mikhail,
>
> Thank you for your work on this. I have some comments:
>
> - There is already a short link (
> https://s.apache.org/beam-community-metrics). Would a link from
> contributing to beam page (if there is not one already) sufficient> People
> can bookmark the short link if they need to quickly access.
> - Metrics is a developer facing tool. If it has its own subdomain and
> start showing up in web search results, it will be a confusing landing 
> page
> for people simply searching for "beam metrics". I believe there is some
> value in having a single domain and linking to various things from there.
> This would be similar to how we link to jira, wiki, mailing list archives.
>
> Ahmet
>
> On Fri, May 17, 2019 at 9:26 PM Mikhail Gryzykhin <
> gryzykhin.mikh...@gmail.com> wrote:
>
>> @Aizamat
>> Code is not generalized and is project specific in some places. But
>> it is small and pretty straightforward so can be ported easily. Whole 
>> thing
>> can be started locally with a single docker command, so it's easy to try 
>> it
>> out.
>>
>> On Fri, May 17, 2019, 19:33 Aizhamal Nurmamat kyzy <
>> aizha...@google.com> wrote:
>>
>>> Hi Mikhail,
>>>
>>> I think this dashboard is amazing, and would love to have an easy
>>> access to it. So here is my non binding +1.
>>>
>>> On the side note, how easy is to recreate it for other Apache
>>> projects? ;)
>>>
>>> Thanks,
>>> Aizhamal
>>>
>>> *From: *Mikhail Gryzykhin 
>>> *Date: *Fri, May 17, 2019 at 6:49 PM
>>> *To: *dev
>>>
>>> Hello everyone,

 Some time ago we started community metrics dashboard.
  However we never had
 added a permanent URL for it. This is really inconvenient to use, since
 only available way to access dashboard is by IP-address.

 In this tread I'd like to:
 1. Vote to assign metrics.beam.apache.org to metrics dashboard (
 http://104.154.241.245).
 2. Gather information on how to do it. I can assume only following
 steps so far: a) vote b) once vote is complete, contact Apache INFRA to
 help with this.

 Regards,
 Mikhail.

Re: Slowly changing lookup cache as a Table in BeamSql

2019-07-18 Thread Pablo Estrada

I've added you as a contributor.

On Thu, Jul 18, 2019 at 7:52 AM rahul patwari 
wrote:

> Hi Rui,
>
> I have a query about BEAM-7758.
> If [Pattern: slowly changing lookup cache] is followed while defining and
> constructing the lookup table and set it with SqlTransform, if any
> aggregation (JOIN) need to be performed, say, with windowed Kafka
> PCollection table and the lookup table, the aggregation cannot be done
> unless both the PCollections have matching WindowFns as they are unbounded.
> What can be done to treat the lookup table as Bounded PCollection and
> perform aggregation with every window of Kafka's PCollection?
>
> Thanks,
> Rahul
>
>
> On Wed, Jul 17, 2019 at 1:06 AM Rui Wang  wrote:
>
>> Another approach is to let BeamSQL support it natively, as the title of
>> this thread says: "as a Table in BeamSQL".
>>
>> We might be able to define a table with properties that says this table
>> return a PCollectionView. By doing so we will have a trigger based
>> PCollectionView available in SQL rel nodes, thus SQL will be able to
>> implement [*Pattern: Slowly-changing lookup cache].* By this way, users
>> only need to construct a table and set it to SqlTransform
>> 
>> *. *
>>
>> Create a JIRA to track this idea:
>> https://jira.apache.org/jira/browse/BEAM-7758
>>
>>
>> -Rui
>>
>>
>> On Tue, Jul 16, 2019 at 7:12 AM Reza Rokni  wrote:
>>
>>> Hi Rahul,
>>>
>>> FYI, that patterns is also available in the Beam docs  ( with updated
>>> code example )
>>> https://beam.apache.org/documentation/patterns/side-input-patterns/.
>>>
>>> Please note in the DoFn that feeds the View.asSingleton() you will need
>>> to manually call BigQuery using the BigQuery client.
>>>
>>> Regards
>>>
>>> Reza
>>>
>>> On Tue, 16 Jul 2019 at 14:37, rahul patwari 
>>> wrote:
>>>
 Hi,

 we are following [*Pattern: Slowly-changing lookup cache*] from
 https://cloud.google.com/blog/products/gcp/guide-to-common-cloud-dataflow-use-case-patterns-part-1

 We have a use case to read slowly changing bounded data as a
 PCollection along with the main PCollection from Kafka(windowed) and use it
 in the query of BeamSql.

 Is it possible to design such a use case with Beam Java SDK?

 Approaches followed but not Successful:

 1) GenerateSequence => GlobalWindow with Data Trigger => Composite
 Transform(which applies Beam I/O on the
 pipeline[PCollection.getPipeline()]) => Convert the resulting PCollection
 to PCollection Apply BeamSQL
 Comments: Beam I/O reads data only once even though a long value is
 generated from GenerateSequece with periodicity. The expectation is that
 whenever a long value is generated, Beam I/O will be used to read the
 latest data. Is this because of optimizations in the DAG? Can the
 optimizations be overridden?

 2) The pipeline is the same as approach 1. But, instead of using a
 composite transform, a DoFn is used where a for loop will emit each Row of
 the PCollection.
 comments: The output PCollection is unbounded. But, we need a bounded
 PCollection as this PCollection is used to JOIN with PCollection of each
 window from Kafka. How can we convert an Unbounded PCollection to Bounded
 PCollection inside a DoFn?

 Are there any better Approaches?

 Regards,
 Rahul

>>>
>>> --
>>>
>>> This email may be confidential and privileged. If you received this
>>> communication by mistake, please don't forward it to anyone else, please
>>> erase all copies and attachments, and please let me know that it has gone
>>> to the wrong person.
>>>
>>> The above terms reflect a potential business arrangement, are provided
>>> solely as a basis for further discussion, and are not intended to be and do
>>> not constitute a legally binding obligation. No legally binding obligations
>>> will be created, implied, or inferred until an agreement in final form is
>>> executed in writing by all parties involved.
>>>
>>

Re: [discuss] A tweak to the Python API for SDF?

2019-07-18 Thread Pablo Estrada

Jira for Java: https://jira.apache.org/jira/browse/BEAM-7250
Jira for Python no : / I just jumped right in to make the change:
https://github.com/apache/beam/pull/8430

On Thu, Jul 18, 2019 at 1:59 AM Ismaël Mejía  wrote:

> Is there a JIRA already to track this?
>
> On Fri, Jun 14, 2019 at 11:52 PM Ismaël Mejía  wrote:
> >
> > +1 for the change for Java too both for consistency with Python and
> > with the way State/Timers work too.
> >
> > It would be really nice if possible to see a concrete proposed example
> > (or even better design doc).
> >
> > Thanks for bringing this idea Pablo and sorry for delayed answer.
> >
> > On Wed, Jun 5, 2019 at 8:44 PM Pablo Estrada  wrote:
> > >
> > > I have no objections.
> > >
> > > +Ismaël Mejía who has familiarity and interest in Java SDF.
> > >
> > > On Wed, Jun 5, 2019 at 11:31 AM Brian Hulette 
> wrote:
> > >>
> > >> Just wanted to resurrect this to say that it seems appropriate to
> make the same change in Java. All the same arguments apply there, and now
> there's the additional argument for maintaining symmetry with Python.
> > >>
> > >> I think BEAM-7250 should be changed to a ticket to actually implement
> this in Java unless someone has an objection.
> > >>
> > >> Brian
> > >>
> > >> On Wed, May 8, 2019 at 2:20 PM Pablo Estrada 
> wrote:
> > >>>
> > >>> Hello all,
> > >>> The API has been updated for Python (See
> https://github.com/apache/beam/pull/8430). Please, if you catch any
> documentation that needs updating, flag to me or just propose the change : )
> > >>>
> > >>> As for Java - we didn't end up determining whether it makes sense to
> update the API as well. Thoughts from others?
> > >>>
> > >>> In any case, I've filed
> https://jira.apache.org/jira/browse/BEAM-7250 to track this for Java.
> > >>>
> > >>> Best
> > >>> -P.
> > >>>
> > >>> On Mon, Apr 29, 2019 at 2:41 PM Lukasz Cwik 
> wrote:
> > >>>>
> > >>>> Pablo, all the SplittableDoFn stuff is marked as @Experimental so
> one is able to change it. There really is only one complicated one to
> change in Watch.java, the rest are quite straightforward.
> > >>>>
> > >>>> On Mon, Apr 29, 2019 at 2:23 PM Pablo Estrada 
> wrote:
> > >>>>>
> > >>>>> Thanks all,
> > >>>>>  @Luke - I imagine that would be an improvement to the API, but
> this may be harder as this is already available to users, and there are
> those who have implemented SDFs under the current API. Would it be possible
> to make a backwards-compatible change to the API here?
> > >>>>>
> > >>>>> For the Python changes, I've proposed a pull request:
> https://github.com/apache/beam/pull/8430 - it was smaller than I thought
> : ) - All comments welcome please.
> > >>>>>
> > >>>>> +Boyuan Zhang I am happy to wait for your SyntheticSource PR to be
> merged and make the appropriate changes if you'd like.
> > >>>>> Best
> > >>>>> -P.
> > >>>>>
> > >>>>> On Mon, Apr 29, 2019 at 8:23 AM Lukasz Cwik 
> wrote:
> > >>>>>>
> > >>>>>> Would it make sense to also do this in the Java SDK?
> > >>>>>>
> > >>>>>> The would make the restriction provider also mirror the TimerSpec
> and StateSpec which use annotations similar to how its done in Python.
> > >>>>>>
> > >>>>>> On Mon, Apr 29, 2019 at 3:42 AM Robert Bradshaw <
> rober...@google.com> wrote:
> > >>>>>>>
> > >>>>>>> +1 to introducing this Param for consistency (and making the
> > >>>>>>> substitution more obvious), and I think SDF is still
> new/experimental
> > >>>>>>> enough we can do this. I don't know if we need Spec in addition
> to
> > >>>>>>> Param and Provider.
> > >>>>>>>
> > >>>>>>> On Sat, Apr 27, 2019 at 1:07 AM Chamikara Jayalath <
> chamik...@google.com> wrote:
> > >>>>>>> >
> > >>>>>>> >
> > >>>>>>> >
> > >>>>>>> > On Fri, Apr

Re: Live fixing of a Beam bug on July 25 at 3:30pm-4:30pm PST

2019-07-20 Thread Pablo Estrada

Hello all,

This will be streamed on youtube on this link:
https://www.youtube.com/watch?v=xpIpEO4PUDo

I think there will be a live chat, so I will hopefully be available to
answer questions. To be honest, my workflow is not super efficient, but...
oh well, hopefully it will be at least somewhat helpful to others : )
Best
-P.

On Thu, Jul 18, 2019 at 12:59 AM Tim Sell  wrote:

> +1, I'd love to see this as a recording. Will you stick it up on youtube
> afterwards?
>
> On Thu, Jul 18, 2019 at 4:00 AM sridhar inuog 
> wrote:
>
>> Thanks, Pablo! Looking forward to it! Hopefully, it will also be recorded
>> as well.
>>
>> On Wed, Jul 17, 2019 at 2:50 PM Pablo Estrada  wrote:
>>
>>> Yes! So I will be working on a small feature request for Java's
>>> BigQueryIO: https://issues.apache.org/jira/browse/BEAM-7607
>>>
>>> Maybe I'll do something for Python next month. : )
>>> Best
>>> -P.
>>>
>>> On Wed, Jul 17, 2019 at 12:32 PM Rakesh Kumar 
>>> wrote:
>>>
>>>> +1, I really appreciate this initiative. It would be really helpful
>>>> newbies like me.
>>>>
>>>> Is it possible to list out what are the things that you are planning to
>>>> cover?
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jul 16, 2019 at 11:19 AM Yichi Zhang  wrote:
>>>>
>>>>> Thanks for organizing this Pablo, it'll be very helpful!
>>>>>
>>>>> On Tue, Jul 16, 2019 at 10:57 AM Pablo Estrada 
>>>>> wrote:
>>>>>
>>>>>> Hello all,
>>>>>> I'll be having a session where I live-fix a Beam bug for 1 hour next
>>>>>> week. Everyone is invited.
>>>>>>
>>>>>> It will be on July 25, between 3:30pm and 4:30pm PST. Hopefully I
>>>>>> will finish a full change in that time frame, but we'll see.
>>>>>>
>>>>>> I have not yet decided if I will do this via hangouts, or via a
>>>>>> youtube livestream. In any case, I will share the link here in the next 
>>>>>> few
>>>>>> days.
>>>>>>
>>>>>> I will most likely work on the Java SDK (I have a little feature
>>>>>> request in mind).
>>>>>>
>>>>>> Thanks!
>>>>>> -P.
>>>>>>
>>>>>

On Auto-creating GCS buckets on behalf of users

2019-07-22 Thread Pablo Estrada

Hello all,
I recently worked on a transform to load data into BigQuery by writing
files to GCS, and issuing Load File jobs to BQ. I did this for the Python
SDK[1].

This option requires the user to provide a GCS bucket to write the files:

- If the user provides a bucket to the transform, the SDK will use that
bucket.
- If the user does not provide a bucket:
- When running in Dataflow, the SDK will borrow the temp_location of
the pipeline.
- When running in other runners, the pipeline will fail.

The Java SDK has had functionality for File Loads into BQ for a long time;
and particularly, when users do not provide a bucket, it attempts to create
a default bucket[2]; and this bucket is used as temp_location (which then
is used by the BQ File Loads transform).

I do not really like creating GCS buckets on behalf of users. In Java, the
outcome is that users will not have to pass a --tempLocation parameter when
submitting jobs to Dataflow - which is a nice convenience, but I'm not sure
that this is in-line with users' expectations.

Currently, the options are:

- Adding support for bucket autocreation for Python SDK
- Deprecating support for bucket autocreation in Java SDK, and printing
a warning.

I am personally inclined for #1. But what do others think?

Best
-P.

[1] https://github.com/apache/beam/pull/7892
[2]
https://github.com/apache/beam/blob/5b3807be717277e3e6880a760b036fecec3bc95d/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L294-L343

Re: Proposal: Add permanent url to community metrics dashboard

2019-07-22 Thread Pablo Estrada

Hi all,
I've filed https://issues.apache.org/jira/browse/INFRA-18786 for this.
Thanks!
-P.

On Thu, Jul 18, 2019 at 1:38 PM Mikhail Gryzykhin  wrote:

> +1 explicitly
>
> On Thu, Jul 18, 2019 at 1:46 AM Łukasz Gajowy 
> wrote:
>
>> +1 for pushing this forward. The url "metrics.beam.apache.org" looks
>> good to me and is generic enough - this is good in case we want to display
>> not only "community metrics" in grafana but for eg. IOIT or load tests
>> resutls.
>>
>> Thanks!
>>
>> czw., 18 lip 2019 o 00:48 Mikhail Gryzykhin 
>> napisał(a):
>>
>>> Thank you Alan, that's an interesting link.
>>>
>>> Latest Grafana version in docker is v6.2.5, so issues on that list are
>>> not applicable. We should be fine on this front. Should update container
>>> version of grafana running on service though.
>>>
>>> @Pablo
>>> I feel it's best for PMC to start conversation with INFRA. I can follow
>>> up on it if you CC me.
>>>
>>> Regards,
>>> Mikhail.
>>>
>>>
>>> On Wed, Jul 17, 2019 at 2:46 PM Alan Myrvold 
>>> wrote:
>>>
>>>> Are all of the CVE issues fixed at the version in use?
>>>> https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=grafana
>>>> XSS isn't much of a concern until there is a hostname associated.
>>>>
>>>> On Wed, Jul 17, 2019 at 2:17 PM Pablo Estrada 
>>>> wrote:
>>>>
>>>>> I'd like to move this forward. Mikhail, would you be interested in
>>>>> filing an issue with Infra to see if it's possible? I can do it if you
>>>>> prefer.
>>>>>
>>>>> It seems that the concerns related to these dashboards showing up in
>>>>> search results have been addressed. Does the community have any other
>>>>> concern around this before we can move it forward?
>>>>> Best
>>>>> -P.
>>>>>
>>>>> On Wed, May 22, 2019 at 8:53 AM Kenneth Knowles 
>>>>> wrote:
>>>>>
>>>>>> I suggest asking infra about the best way to proceed, so that we
>>>>>> don't vote on something that doesn't work for them. This might be 
>>>>>> something
>>>>>> handy to spin up easily for any Apache project using similar tools.
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>> On Tue, May 21, 2019 at 1:02 PM Mikhail Gryzykhin 
>>>>>> wrote:
>>>>>>
>>>>>>> Current http://104.154.241.245/robots.txt is already disallow all,
>>>>>>> so we are good here.
>>>>>>>
>>>>>>> On Tue, May 21, 2019 at 12:57 PM Ahmet Altay 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> If SSL is a concern that makes sense, I am not familiar with that
>>>>>>>> enough to suggest whether another way to do this exists or not.
>>>>>>>>
>>>>>>>> It will be good to check that we can set robots.txt properly from
>>>>>>>> the begging if we go down this path.
>>>>>>>>
>>>>>>>> On Mon, May 20, 2019 at 10:54 AM Mikhail Gryzykhin <
>>>>>>>> mig...@google.com> wrote:
>>>>>>>>
>>>>>>>>> @Ahmet Altay 
>>>>>>>>> Thank you for the comment.
>>>>>>>>>
>>>>>>>>> Point on search engines is really good. If that happens we can
>>>>>>>>> look into configuring robots.txt to notify search engines to ignore 
>>>>>>>>> whole
>>>>>>>>> domain.
>>>>>>>>> The link is a redirect to static IP. So it is still confusing.
>>>>>>>>>
>>>>>>>>> Having domain name will allow for getting SSL associated with it
>>>>>>>>> and will allow to keep same address even if IP changes (say we want 
>>>>>>>>> to move
>>>>>>>>> to other hoster).
>>>>>>>>>
>>>>>>>>
>>>>>>>> I suppose short link will also allow us to change the host very
>>>>>>>> similar to a domain name. That is a minor point anyway.
>>>>>>>>
>>>>>>&g

Re: Live fixing of a Beam bug on July 25 at 3:30pm-4:30pm PST

2019-07-23 Thread Pablo Estrada

Hi all,
I've just realized that https://issues.apache.org/jira/browse/BEAM-7607 is
a single-line change - and we'd spend 40 minutes chitchatting, so I'll also
be working on https://jira.apache.org/jira/browse/BEAM-7803, which is a
Python issue (also for the BigQuery sink!).
Thanks!
-P.

On Sat, Jul 20, 2019 at 2:05 PM Pablo Estrada  wrote:

> Hello all,
>
> This will be streamed on youtube on this link:
> https://www.youtube.com/watch?v=xpIpEO4PUDo
>
> I think there will be a live chat, so I will hopefully be available to
> answer questions. To be honest, my workflow is not super efficient, but...
> oh well, hopefully it will be at least somewhat helpful to others : )
> Best
> -P.
>
> On Thu, Jul 18, 2019 at 12:59 AM Tim Sell  wrote:
>
>> +1, I'd love to see this as a recording. Will you stick it up on youtube
>> afterwards?
>>
>> On Thu, Jul 18, 2019 at 4:00 AM sridhar inuog 
>> wrote:
>>
>>> Thanks, Pablo! Looking forward to it! Hopefully, it will also be
>>> recorded as well.
>>>
>>> On Wed, Jul 17, 2019 at 2:50 PM Pablo Estrada 
>>> wrote:
>>>
>>>> Yes! So I will be working on a small feature request for Java's
>>>> BigQueryIO: https://issues.apache.org/jira/browse/BEAM-7607
>>>>
>>>> Maybe I'll do something for Python next month. : )
>>>> Best
>>>> -P.
>>>>
>>>> On Wed, Jul 17, 2019 at 12:32 PM Rakesh Kumar 
>>>> wrote:
>>>>
>>>>> +1, I really appreciate this initiative. It would be really helpful
>>>>> newbies like me.
>>>>>
>>>>> Is it possible to list out what are the things that you are planning
>>>>> to cover?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jul 16, 2019 at 11:19 AM Yichi Zhang 
>>>>> wrote:
>>>>>
>>>>>> Thanks for organizing this Pablo, it'll be very helpful!
>>>>>>
>>>>>> On Tue, Jul 16, 2019 at 10:57 AM Pablo Estrada 
>>>>>> wrote:
>>>>>>
>>>>>>> Hello all,
>>>>>>> I'll be having a session where I live-fix a Beam bug for 1 hour next
>>>>>>> week. Everyone is invited.
>>>>>>>
>>>>>>> It will be on July 25, between 3:30pm and 4:30pm PST. Hopefully I
>>>>>>> will finish a full change in that time frame, but we'll see.
>>>>>>>
>>>>>>> I have not yet decided if I will do this via hangouts, or via a
>>>>>>> youtube livestream. In any case, I will share the link here in the next 
>>>>>>> few
>>>>>>> days.
>>>>>>>
>>>>>>> I will most likely work on the Java SDK (I have a little feature
>>>>>>> request in mind).
>>>>>>>
>>>>>>> Thanks!
>>>>>>> -P.
>>>>>>>
>>>>>>

Re: Choosing a coder for a class that contains a Row?

2019-07-23 Thread Pablo Estrada

+dev 
Thanks Ryan! This is quite helpful. Still not what I need : ) - but useful.

The data is change data capture from databases, and I'm putting it into a
Beam Row. The schema for the Row is generally homogeneous, but subject to
change at some point in the future if the schema in the database changes.
It's unusual and unlikely, but possible. I have no idea how Beam deals with
evolving schemas. +Reuven Lax  is there documentation /
examples / anything around this? : )

I think evolving schemas is an interesting question

For now, I am going to Java-serialize the objects, and delay figuring this
out. But I reckon I'll have to come back to this...

Best
-P.

On Tue, Jul 23, 2019 at 1:07 AM Ryan Skraba  wrote:

> Hello Pablo!  Just to clarify -- the Row schemas aren't known at
> pipeline construction time, but can be discovered from the instance of
> MyData?
>
> Once discovered, is the schema "homogeneous" for all instance of
> MyData?  (i.e. someRow will always have the same schema for all
> instances afterwards, and there won't be another someRow with a
> different schema).
>
> We've encountered a parallel "problem" with pure Avro data, where the
> instance is a GenericRecord containing it's own Avro schema but
> *without* knowing the schema until the pipeline is run.  The solution
> that we've been using is a bit hacky, but we're using an ad hoc
> per-job schema registry and a custom coder where each worker saves the
> schema in the `encode` before writing the record, and loads it lazily
> in the `decode` before reading.
>
> The original code is available[1] (be gentle, it was written with Beam
> 0.4.0-incubating... and has continued to work until now).
>
> In practice, the ad hoc schema registry is just a server socket in the
> Spark driver, in-memory for DirectRunner / local mode, and a a
> read/write to a known location in other runners.  There are definitely
> other solutions with side-inputs and providers, and the job server in
> portability looks like an exciting candidate for per-job schema
> registry story...
>
> I'm super eager to see if there are other ideas or a contribution we
> can make in this area that's "Beam Row" oriented!
>
> Ryan
>
> [1]
> https://github.com/Talend/components/blob/master/core/components-adapter-beam/src/main/java/org/talend/components/adapter/beam/coders/LazyAvroCoder.java
>
> On Tue, Jul 23, 2019 at 12:49 AM Pablo Estrada  wrote:
> >
> > Hello all,
> > I am writing a utility to push data to PubSub. My data class looks
> something like so:
> > ==
> > class MyData {
> >   String someId;
> >   Row someRow;
> >   Row someOtherRow;
> > }
> > ==
> > The schema for the Rows is not known a-priori. It is contained by the
> Row. I am then pushing this data to pubsub:
> > ===
> > MyData pushingData = 
> > WhatCoder? coder = 
> >
> > ByteArrayOutputStream os = new ByteArrayOutputStream();
> > coder.encode(this, os);
> >
> > pubsubClient.connect();
> >
> pubsubClient.push(PubSubMessage.newBuilder().setData(os.toByteArray()).build());
> > pubsubClient.close();
> > =
> > What's the right coder to use in this case? I don't know if SchemaCoder
> will work, because it seems that it requires the Row's schema a priori. I
> have not been able to make AvroCoder work.
> >
> > Any tips?
> > Best
> > -P.
>

Re: Python Beam pipelines on Flink on Kubernetes

2019-07-24 Thread Pablo Estrada

I am very happy to see this. I'll take a look, and leave my comments.

I think this is something we'd been needing, and it's great that you guys
are putting thought into it. Thanks!<3

On Wed, Jul 24, 2019 at 9:01 PM Thomas Weise  wrote:

> Hi,
>
> Recently Lyft open sourced *FlinkK8sOperator,* a Kubernetes operator to
> manage Flink deployments on Kubernetes:
>
> https://github.com/lyft/flinkk8soperator/
>
> We are now discussing how to extend this operator to also support
> deployment of Python Beam pipelines with the Flink runner. I would like to
> share the proposal with the Beam community to enlist feedback as well as
> explore opportunities for collaboration:
>
>
> https://docs.google.com/document/d/1z3LNrRtr8kkiFHonZ5JJM_L4NWNBBNcqRc_yAf6G0VI/
>
> Looking forward to your comments and suggestions!
>
> Thomas
>
>

Re: [2.14.0] Release Progress Update

2019-07-25 Thread Pablo Estrada

Hi Anton,
are there updates on the release?
Thanks!
-P.

On Fri, Jul 19, 2019 at 12:33 PM Anton Kedin  wrote:

> Verification build succeeds except for AWS IO (which has tests hanging). I
> will continue the release process as normal and will investigate the AWS IO
> issue meanwhile. Will either disable the hanging tests to get the artifacts
> for an RC or will continue without it temporarily, will need to re-validate
> it when the issue is resolved.
>
> Regards,
> Anton
>
> On Thu, Jul 18, 2019 at 8:54 AM Anton Kedin  wrote:
>
>> All cherry-picks are merged, blocker jiras closed, running the
>> verification build.
>>
>> On Mon, Jul 15, 2019 at 4:53 PM Ahmet Altay  wrote:
>>
>>> Anton, any updates on this release? Do you need help?
>>>
>>> On Fri, Jun 28, 2019 at 11:42 AM Anton Kedin  wrote:
>>>
 I have been running validation builds (had some hickups with that),
 everything looks mostly good, except failures in `:beam-test-tools` and
 `:io:aws`. Now I will start cherry-picking other fixes and trying to figure
 the specific issues out.

 Regards,
 Anton

 On Fri, Jun 21, 2019 at 3:17 PM Anton Kedin  wrote:

> Not much progress today. Debugging build issues when running global
> `./gradlew build -PisRelease --scan`
>
> Regards,
> Anton
>
> On Thu, Jun 20, 2019 at 4:12 PM Anton Kedin  wrote:
>
>> Published the snapshots, working through the
>> verify_release_validation script
>>
>> Got another blocker to be cherry-picked when merged:
>> https://issues.apache.org/jira/browse/BEAM-7603
>>
>> Regards,
>> Anton
>>
>>
>> On Wed, Jun 19, 2019 at 4:17 PM Anton Kedin  wrote:
>>
>>> I have cut the release branch for 2.14.0 and working through the
>>> release process. Next step is building the snapshot and release branch
>>> verification.
>>>
>>> There are two issues [1] that are still not resolved that are marked
>>> as blockers at the moment:
>>>  * [2] BEAM-7478 - remote cluster submission from Flink Runner
>>> broken;
>>>  * [3] BEAM-7424 - retries for GCS;
>>>
>>> [1]
>>> https://issues.apache.org/jira/browse/BEAM-7478?jql=project%20%3D%20BEAM%20%20AND%20fixVersion%20%3D%202.14.0%20AND%20status%20!%3D%20Closed%20AND%20status%20!%3DResolved
>>> [2] https://issues.apache.org/jira/browse/BEAM-7478
>>> [3] https://issues.apache.org/jira/browse/BEAM-7424
>>>
>>> Regards,
>>> Anton
>>>
>>

Re: Live fixing of a Beam bug on July 25 at 3:30pm-4:30pm PST

2019-07-25 Thread Pablo Estrada

The link is here: https://www.youtube.com/watch?v=xpIpEO4PUDo
This is still happening.

On Thu, Jul 25, 2019 at 2:55 PM Innocent Djiofack 
wrote:

> Did I miss the link or this was postponed?
>
> On Tue, Jul 23, 2019 at 3:05 PM Austin Bennett <
> whatwouldausti...@gmail.com> wrote:
>
>> Pablo,
>>
>> Assigned  https://issues.apache.org/jira/browse/BEAM-7607 to you, to
>> make even more likely that it is still around on the 25th :-)
>>
>> Cheers,
>> Austin
>>
>> On Tue, Jul 23, 2019 at 11:24 AM Pablo Estrada 
>> wrote:
>>
>>> Hi all,
>>> I've just realized that https://issues.apache.org/jira/browse/BEAM-7607 is
>>> a single-line change - and we'd spend 40 minutes chitchatting, so I'll also
>>> be working on https://jira.apache.org/jira/browse/BEAM-7803, which is a
>>> Python issue (also for the BigQuery sink!).
>>> Thanks!
>>> -P.
>>>
>>> On Sat, Jul 20, 2019 at 2:05 PM Pablo Estrada 
>>> wrote:
>>>
>>>> Hello all,
>>>>
>>>> This will be streamed on youtube on this link:
>>>> https://www.youtube.com/watch?v=xpIpEO4PUDo
>>>>
>>>> I think there will be a live chat, so I will hopefully be available to
>>>> answer questions. To be honest, my workflow is not super efficient, but...
>>>> oh well, hopefully it will be at least somewhat helpful to others : )
>>>> Best
>>>> -P.
>>>>
>>>> On Thu, Jul 18, 2019 at 12:59 AM Tim Sell  wrote:
>>>>
>>>>> +1, I'd love to see this as a recording. Will you stick it up on
>>>>> youtube afterwards?
>>>>>
>>>>> On Thu, Jul 18, 2019 at 4:00 AM sridhar inuog 
>>>>> wrote:
>>>>>
>>>>>> Thanks, Pablo! Looking forward to it! Hopefully, it will also be
>>>>>> recorded as well.
>>>>>>
>>>>>> On Wed, Jul 17, 2019 at 2:50 PM Pablo Estrada 
>>>>>> wrote:
>>>>>>
>>>>>>> Yes! So I will be working on a small feature request for Java's
>>>>>>> BigQueryIO: https://issues.apache.org/jira/browse/BEAM-7607
>>>>>>>
>>>>>>> Maybe I'll do something for Python next month. : )
>>>>>>> Best
>>>>>>> -P.
>>>>>>>
>>>>>>> On Wed, Jul 17, 2019 at 12:32 PM Rakesh Kumar 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1, I really appreciate this initiative. It would be really helpful
>>>>>>>> newbies like me.
>>>>>>>>
>>>>>>>> Is it possible to list out what are the things that you are
>>>>>>>> planning to cover?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jul 16, 2019 at 11:19 AM Yichi Zhang 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks for organizing this Pablo, it'll be very helpful!
>>>>>>>>>
>>>>>>>>> On Tue, Jul 16, 2019 at 10:57 AM Pablo Estrada 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hello all,
>>>>>>>>>> I'll be having a session where I live-fix a Beam bug for 1 hour
>>>>>>>>>> next week. Everyone is invited.
>>>>>>>>>>
>>>>>>>>>> It will be on July 25, between 3:30pm and 4:30pm PST. Hopefully I
>>>>>>>>>> will finish a full change in that time frame, but we'll see.
>>>>>>>>>>
>>>>>>>>>> I have not yet decided if I will do this via hangouts, or via a
>>>>>>>>>> youtube livestream. In any case, I will share the link here in the 
>>>>>>>>>> next few
>>>>>>>>>> days.
>>>>>>>>>>
>>>>>>>>>> I will most likely work on the Java SDK (I have a little feature
>>>>>>>>>> request in mind).
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>> -P.
>>>>>>>>>>
>>>>>>>>>
>
> --
>
> *DJIOFACK INNOCENT*
> *"Be better than the day before!" -*
> *+1 404 751 8024*
>

Re: Live fixing of a Beam bug on July 25 at 3:30pm-4:30pm PST

2019-07-25 Thread Pablo Estrada

Thanks for those who tuned in : ) - I feel like I might have spent too long
fiddling with Python code, and not long enough doing setup, testing, etc. I
will try to do another one where I just test / setup the environment / lint
checks etc.

Here are links for:
Setting up the Python environment: https://youtu.be/xpIpEO4PUDo?t=334
Quickly setting up the Java environment: https://youtu.be/xpIpEO4PUDo?t=3659

Doing a Pull Request: https://youtu.be/xpIpEO4PUDo?t=3770

On Thu, Jul 25, 2019 at 4:39 PM sridhar inuog 
wrote:

> Thanks, Pablo for organizing this session. I found it useful.
>
> On Thu, Jul 25, 2019 at 4:56 PM Pablo Estrada  wrote:
>
>> The link is here: https://www.youtube.com/watch?v=xpIpEO4PUDo
>> This is still happening.
>>
>> On Thu, Jul 25, 2019 at 2:55 PM Innocent Djiofack 
>> wrote:
>>
>>> Did I miss the link or this was postponed?
>>>
>>> On Tue, Jul 23, 2019 at 3:05 PM Austin Bennett <
>>> whatwouldausti...@gmail.com> wrote:
>>>
>>>> Pablo,
>>>>
>>>> Assigned  https://issues.apache.org/jira/browse/BEAM-7607 to you, to
>>>> make even more likely that it is still around on the 25th :-)
>>>>
>>>> Cheers,
>>>> Austin
>>>>
>>>> On Tue, Jul 23, 2019 at 11:24 AM Pablo Estrada 
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>> I've just realized that
>>>>> https://issues.apache.org/jira/browse/BEAM-7607 is a single-line
>>>>> change - and we'd spend 40 minutes chitchatting, so I'll also be working 
>>>>> on
>>>>> https://jira.apache.org/jira/browse/BEAM-7803, which is a Python
>>>>> issue (also for the BigQuery sink!).
>>>>> Thanks!
>>>>> -P.
>>>>>
>>>>> On Sat, Jul 20, 2019 at 2:05 PM Pablo Estrada 
>>>>> wrote:
>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>> This will be streamed on youtube on this link:
>>>>>> https://www.youtube.com/watch?v=xpIpEO4PUDo
>>>>>>
>>>>>> I think there will be a live chat, so I will hopefully be available
>>>>>> to answer questions. To be honest, my workflow is not super efficient,
>>>>>> but... oh well, hopefully it will be at least somewhat helpful to others 
>>>>>> : )
>>>>>> Best
>>>>>> -P.
>>>>>>
>>>>>> On Thu, Jul 18, 2019 at 12:59 AM Tim Sell  wrote:
>>>>>>
>>>>>>> +1, I'd love to see this as a recording. Will you stick it up on
>>>>>>> youtube afterwards?
>>>>>>>
>>>>>>> On Thu, Jul 18, 2019 at 4:00 AM sridhar inuog <
>>>>>>> sridharin...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks, Pablo! Looking forward to it! Hopefully, it will also be
>>>>>>>> recorded as well.
>>>>>>>>
>>>>>>>> On Wed, Jul 17, 2019 at 2:50 PM Pablo Estrada 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Yes! So I will be working on a small feature request for Java's
>>>>>>>>> BigQueryIO: https://issues.apache.org/jira/browse/BEAM-7607
>>>>>>>>>
>>>>>>>>> Maybe I'll do something for Python next month. : )
>>>>>>>>> Best
>>>>>>>>> -P.
>>>>>>>>>
>>>>>>>>> On Wed, Jul 17, 2019 at 12:32 PM Rakesh Kumar <
>>>>>>>>> rakeshku...@lyft.com> wrote:
>>>>>>>>>
>>>>>>>>>> +1, I really appreciate this initiative. It would be really
>>>>>>>>>> helpful newbies like me.
>>>>>>>>>>
>>>>>>>>>> Is it possible to list out what are the things that you are
>>>>>>>>>> planning to cover?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 16, 2019 at 11:19 AM Yichi Zhang 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks for organizing this Pablo, it'll be very helpful!
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jul 16, 2019 at 10:57 AM Pablo Estrada <
>>>>>>>>>>> pabl...@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello all,
>>>>>>>>>>>> I'll be having a session where I live-fix a Beam bug for 1 hour
>>>>>>>>>>>> next week. Everyone is invited.
>>>>>>>>>>>>
>>>>>>>>>>>> It will be on July 25, between 3:30pm and 4:30pm PST. Hopefully
>>>>>>>>>>>> I will finish a full change in that time frame, but we'll see.
>>>>>>>>>>>>
>>>>>>>>>>>> I have not yet decided if I will do this via hangouts, or via a
>>>>>>>>>>>> youtube livestream. In any case, I will share the link here in the 
>>>>>>>>>>>> next few
>>>>>>>>>>>> days.
>>>>>>>>>>>>
>>>>>>>>>>>> I will most likely work on the Java SDK (I have a little
>>>>>>>>>>>> feature request in mind).
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks!
>>>>>>>>>>>> -P.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>
>>> --
>>>
>>> *DJIOFACK INNOCENT*
>>> *"Be better than the day before!" -*
>>> *+1 404 751 8024*
>>>
>>

Re: [VOTE] Release 2.14.0, release candidate #1

2019-07-30 Thread Pablo Estrada

+1

I installed from source, and ran unit tests for Python in 2.7, 3.5, 3.6.

Also ran a number of integration tests on Py 3.5 on Dataflow and
DirectRunner.
Best
-P.

On Tue, Jul 30, 2019 at 11:09 AM Hannah Jiang 
wrote:

> I checked Py3 tests using .zip, mainly with direct runners, and everything
> looks good, so +1.
>
> On Tue, Jul 30, 2019 at 2:08 AM Robert Bradshaw 
> wrote:
>
>> I checked all the artifact signatures and ran a couple test pipelines
>> with the wheels (Py2 and Py3) and everything looked good to me, so +1.
>>
>> On Mon, Jul 29, 2019 at 8:29 PM Valentyn Tymofieiev 
>> wrote:
>>
>>> I have checked Python 3 batch and streaming quickstarts on Dataflow
>>> runner using .zip and wheel distributions. So far +1 from me.
>>>
>>> On Mon, Jul 29, 2019 at 7:53 PM Ahmet Altay  wrote:
>>>
 +1, validated python 2 quickstarts.

 On Fri, Jul 26, 2019 at 5:46 PM Ahmet Altay  wrote:

> To confirm, I manuall validated leader board on python. It is working.
>
> On Fri, Jul 26, 2019 at 5:23 PM Yifan Zou  wrote:
>
>> AFAIK, there should not be any special prerequisites for this. Things
>> the script does including:
>> 1. download the python rc in zip
>> 2. start virtualenv and install the sdk.
>> 3. verify hash.
>> 4. config settings.xml and start a Java pubsub message injector.
>> 5. run game examples and validate.
>>
>> Could you double check if the sdk was installed properly (step 1&2)?
>>
>
> I also guessing this is the case. Probably something earlier in the
> validation script did not run as expected.
>
>
>>
>>
> Yifan
>>
>> On Fri, Jul 26, 2019 at 2:38 PM Anton Kedin  wrote:
>>
>>> Validation script fails for me when I try to run [1] python
>>> leaderboard with direct runner:
>>>
>>> ```
>>> *
>>> * Running Python Leaderboard with DirectRunner
>>> *
>>> /usr/bin/python: No module named apache_beam.examples.complete.game
>>> ```
>>>
>>> If someone has more context, what are the prerequisites for this
>>> step? How does it look up the module?
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/master/release/src/main/scripts/run_rc_validation.sh#L424
>>>
>>> Regards,
>>> Anton
>>>
>>> On Fri, Jul 26, 2019 at 10:23 AM Anton Kedin 
>>> wrote:
>>>
 Cool, will make the post and will update the release guide as well
 then

 On Fri, Jul 26, 2019 at 10:20 AM Chad Dombrova 
 wrote:

> I think the release guide needs to be updated to remove the
>> optionality of blog creation and avoid confusion. Thanks for 
>> pointing that
>> out.
>>
>
> +1
>
>

1 2 3 4 5 6 >

1 - 100 of 587 matches

Mail list logo