Re: Best patterns for a polling transform

2023-06-22 Thread Chad Dombrova
I’m also interested in the answer to this.  This is essential for reading
from many types of data sources.


On Tue, Jun 20, 2023 at 2:57 PM Sam Bourne  wrote:

> +dev to see if anyone has any suggestions.
>
> On Fri, Jun 16, 2023 at 5:46 PM Sam Bourne  wrote:
>
>> Hello beam community!
>>
>> I’m having trouble coming up with the best pattern to *eagerly* poll. By
>> eagerly, I mean that elements should be consumed and yielded as soon as
>> possible. There are a handful of experiments that I’ve tried and my latest
>> attempt using the timer API seems quite promising, but is operating in a
>> way that I find rather unintuitive. My solution was to create a sort of
>> recursive timer callback - which I found one example
>> 
>> of within the beam test code.
>>
>> I have a few questions:
>>
>> 1) The below code runs fine with a single worker but with multiple
>> workers there are duplicate values. It seems that the callback and snapshot
>> of the state is provided to multiple workers and the number of duplications
>> increases with the number of workers. Is this due to the values being
>> provided to timer.set?
>>
>> 2) I’m using TimeDomain.WATERMARK here due to it simply not working when
>> using REAL_TIME. The docs
>> 
>> seem to suggest REAL_TIME would be the way to do this, however there
>> seems to be no guarantee that a REAL_TIME callback will run. In this
>> sample setting the timer to REAL_TIME will simply not ever fire the
>> callback. Interestingly, if you call timer.set with any value less than
>> the current time.time(), then the callback will run, however it seems to
>> fire immediately regardless of the value (and in this sample will actually
>> raise an AssertionError
>> 
>> ).
>>
>> I’m happy for suggestions!
>> -Sam
>>
>> import randomimport threading
>> import apache_beam as beamimport apache_beam.coders as codersimport 
>> apache_beam.transforms.combiners as combinersimport 
>> apache_beam.transforms.userstate as userstateimport 
>> apache_beam.utils.timestamp as timestampfrom 
>> apache_beam.options.pipeline_options import PipelineOptions
>> class Log(beam.PTransform):
>>
>> lock = threading.Lock()
>>
>> @classmethod
>> def _log(cls, element, label):
>> with cls.lock:
>> # This just colors the print in terminal
>> print('\033[1m\033[92m{}\033[0m : {!r}'.format(label, element))
>> return element
>>
>> def expand(self, pcoll):
>> return pcoll | beam.Map(self._log, self.label)
>> class EagerProcess(beam.DoFn):
>>
>> BUFFER_STATE = userstate.BagStateSpec('buffer', coders.PickleCoder())
>> POLL_TIMER = userstate.TimerSpec('timer', beam.TimeDomain.WATERMARK)
>>
>> def process(
>> self,
>> element,
>> buffer=beam.DoFn.StateParam(BUFFER_STATE),
>> timer=beam.DoFn.TimerParam(POLL_TIMER),
>> ):
>> _, item = element
>>
>> for i in range(item):
>> buffer.add(i)
>>
>> timer.set(timestamp.Timestamp.now() + timestamp.Duration(seconds=10))
>>
>> @userstate.on_timer(POLL_TIMER)
>> def flush(
>> self,
>> buffer=beam.DoFn.StateParam(BUFFER_STATE),
>> timer=beam.DoFn.TimerParam(POLL_TIMER),
>> ):
>> cache = buffer.read()
>> buffer.clear()
>>
>> requeue = False
>> for item in cache:
>> if random.random() < 0.1:
>> yield item
>> else:
>> buffer.add(item)
>> requeue = True
>>
>> if requeue:
>> timer.set(timestamp.Timestamp.now() + 
>> timestamp.Duration(seconds=10))
>> def main():
>> options = PipelineOptions.from_dictionary({
>> 'direct_num_workers': 3,
>> 'direct_running_mode': 'multi_threading',
>> })
>>
>> pipe = beam.Pipeline(options=options)
>> (
>> pipe
>> | beam.Create([10])
>> | 'Init' >> Log()
>> | beam.Reify.Timestamp()
>> | 'PairWithKey' >> beam.Map(lambda x: (hash(x), x))
>> | beam.ParDo(EagerProcess())
>> | 'Complete' >> Log()
>> | beam.transforms.combiners.Count.Globally()
>> | 'Count' >> Log()
>> )
>> result = pipe.run()
>> result.wait_until_finish()
>> if __name__ == '__main__':
>> main()
>>
>>


Re: GSoC idea: mypyc as an alternative to cython

2022-05-25 Thread Chad Dombrova
>
> - What does the new prototype code look like (hopefully much cleaner)?
>

Instead of a separate pxd file, you just have the existing .py file with
standard typing annotations.


> - How does performance compare to the Cython approach?
>

Good question.  I've not been able to find any posts with comparisons.
mypyc maintains a benchmark repo with results compared to standard
cpython:
https://github.com/mypyc/mypyc-benchmark-results/blob/master/reports/summary-main.md

Running these benchmarks against cython could be a good first task.

Unlike Cython, mypyc doesn’t natively support numpy, but IIRC beam is not
using that in its cythonized modules.

-chad


GSoC idea: mypyc as an alternative to cython

2022-02-11 Thread Chad Dombrova
Hi all,
At work, I recently started playing around with mypyc[1] as a means to
compile our python code to C extensions, and I'm pretty impressed so far.

Pros

   - write normal python code with annotations:  we're already doing this!
   - no need for cython-specific header files that can get out of sync with
   the pure python version
   - support for dataclasses and Generics
   - one less tool in the toolchain: mypyc is part of mypy
   - opens the door to more easily converting additional modules in the
   future

Cons

   - less mature than Cython
   - build errors are not very informative

Neural

   - requires more detailed annotations.  for example, you must annotate
   class attributes with ClassVar

I thought it would be an interesting and relatively accessible project to
try to convert the current modules that use cython over to mypyc and see
how it goes.  Just a thought: take it or leave it!

-chad

[1] https://mypyc.readthedocs.io/en/latest/introduction.html


Re: [PROPOSAL] Batched DoFns in the Python SDK

2022-01-19 Thread Chad Dombrova
>
> Thanks Chad I'll take a look at your talk and design to see if there's any
> ideas we can merge.
>

Thanks Brian.  My hope is that even if you don't add the complete
scheduling framework, we'll get all the features and hooks we need to build
our toolset without needing to modify beam code (which is technical debt
that I'd rather not have).  Then we can offer our tool on PyPI for those
who want it.  I'm happy to have a call with you and the beam team to
discuss the details once you've had a look.

-chad


Re: [PROPOSAL] Batched DoFns in the Python SDK

2021-12-17 Thread Chad Dombrova
Hi Brian,
We implemented a feature that's similar to this, but with a different
motivation: scheduled tasks.  We had the same need of creating batches of
logical elements, but rather than perform SIMD-optimized computations, we
want to produce remotely scheduled tasks.  It's my hope that the underlying
job of slicing a stream into batches of elements can achieve both goals.
We've been using this change in production for awhile now, but I would
really love to get this onto master.  Can you have a look and see how it
compares to what you have in mind?

Here's the info I sent out about this earlier:

Beam's niche is low latency, high throughput workloads, but Beam has
incredible promise as an orchestrator of long running work that gets sent
to a scheduler.  We've created a modified version of Beam that allows the
python SDK worker to outsource tasks to a scheduler, like Kubernetes batch
jobs[1], Argo[2], or Google's own OpenCue[3].

The basic idea is that any element in a stream can be tagged to be executed
outside of the normal SdkWorker as an atomic "task".  A task is one
invocation of a stage, composed of one or more DoFns, against a slice of
the data stream, composed of one or more tagged elements.   The upshot is
that we're able to slice up the processing of a stream across potentially
*many* workers, with the trade-off being the added overhead of starting up
a worker process for each task.

For more info on how we use our modified version of Beam to make visual
effects for feature films, check out the talk[4] I gave at the Beam Summit.

Here's our design doc:
https://docs.google.com/document/d/1GrAvDWwnR1QAmFX7lnNA7I_mQBC2G1V2jE2CZOc6rlw/edit?usp=sharing

And here's the github branch:
https://github.com/LumaPictures/beam/tree/taskworker_public


[1] https://kubernetes.io/docs/concepts/workloads/controllers/job/
[2] https://argoproj.github.io/
[3] https://cloud.google.com/opencue
[4] https://www.youtube.com/watch?v=gvbQI3I03a8_channel=ApacheBeam

-chad


On Wed, Dec 15, 2021 at 9:59 AM Brian Hulette  wrote:

> Hi all,
>
> I've drafted a proposal to add a "Batched DoFn" concept to the Beam Python
> SDK. The primary motivation is to make it more natural to draft vectorized
> Beam pipelines by allowing DoFns to produce and/or consume batches of
> logical elements. You can find the proposal here:
>
> https://s.apache.org/batched-dofns
>
> Please take a look and let me know what you think.
>
> Thanks!
> Brian
>


Re: Proposal: Generalize S3FileSystem

2021-05-20 Thread Chad Dombrova
Hi Brian,
I think the main goal would be to make a python package that could be pip
installed independently of apache_beam.  That goal could be accomplished
with option 3, thus preserving all of the benefits of a monorepo. If it
gains enough popularity and contributors outside of the Beam community,
then options 1 and 2 could be considered to make it easier to foster a new
community of contributors.

Beam has a lot of great tech in it, and it makes me think of Celery, which
is a much older python project of a similar ilk that spawned a series of
useful independent projects: kombu [1], an AMQP messaging library, and
billiard [2], a multiprocessing library.

Obviously, there are a number of pros and cons to consider.  The cons are
pretty clear: even within a monorepo it will make the Beam build more
complicated.  The pros are a bit more abstract.  The fileIO project could
appeal to a broader audience, and act as a signpost for Beam (on PyPI,
etc), thereby increasing awareness of Beam amongst the types of
cloud-friendly python developers who would need the fileIO package.

-chad

[1] https://github.com/celery/kombu
[2] https://github.com/celery/billiard




On Thu, May 20, 2021 at 7:57 AM Brian Hulette  wrote:

> That's an interesting idea. What do you mean by its own project? A couple
> of possibilities:
> - Spinning off a new ASF project
> - A separate Beam-governed repository (e.g. apache/beam-filesystems)
> - More clearly separate it in the current build system and release
> artifacts that allow it to be used independently
>
> Personally I'd be resistant to the first two (I am a Google engineer and I
> like monorepos after all), but I don't see a major problem with the last
> one, except that it gives us another surface to maintain.
>
> Brian
>
> On Wed, May 19, 2021 at 8:38 PM Chad Dombrova  wrote:
>
>> This is a random idea, but the whole file IO system inside Beam would
>> actually be awesome to extract into its own project.  IIRC, it’s not
>> particularly tied to Beam.
>>
>> I’m not saying this should be done now, but it’s be nice to keep it mind
>> for a future goal.
>>
>> -chad
>>
>>
>>
>> On Wed, May 19, 2021 at 10:23 AM Pablo Estrada 
>> wrote:
>>
>>> That would be great to add, Matt. Of course it's important to make this
>>> backwards compatible, but other than that, the addition would be very
>>> welcome.
>>>
>>> On Wed, May 19, 2021 at 9:41 AM Matt Rudary 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> This is a quick sketch of a proposal – I wanted to get a sense of
>>>> whether there’s general support for this idea before fleshing it out
>>>> further, getting internal approvals, etc.
>>>>
>>>>
>>>>
>>>> I’m working with multiple storage systems that speak the S3 api. I
>>>> would like to support FileIO operations for these storage systems, but
>>>> S3FileSystem hardcodes the s3 scheme (the various systems use different URI
>>>> schemes) and it is in any case impossible to instantiate more than one in
>>>> the current design.
>>>>
>>>>
>>>>
>>>> I’d like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and
>>>> maybe …aws.options) somewhat to enable this use-case. I haven’t worked out
>>>> the details yet, but it will take some thought to make this work in a
>>>> non-hacky way.
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>> Matt Rudary
>>>>
>>>


Re: Proposal: Generalize S3FileSystem

2021-05-19 Thread Chad Dombrova
This is a random idea, but the whole file IO system inside Beam would
actually be awesome to extract into its own project.  IIRC, it’s not
particularly tied to Beam.

I’m not saying this should be done now, but it’s be nice to keep it mind
for a future goal.

-chad



On Wed, May 19, 2021 at 10:23 AM Pablo Estrada  wrote:

> That would be great to add, Matt. Of course it's important to make this
> backwards compatible, but other than that, the addition would be very
> welcome.
>
> On Wed, May 19, 2021 at 9:41 AM Matt Rudary 
> wrote:
>
>> Hi,
>>
>>
>>
>> This is a quick sketch of a proposal – I wanted to get a sense of whether
>> there’s general support for this idea before fleshing it out further,
>> getting internal approvals, etc.
>>
>>
>>
>> I’m working with multiple storage systems that speak the S3 api. I would
>> like to support FileIO operations for these storage systems, but
>> S3FileSystem hardcodes the s3 scheme (the various systems use different URI
>> schemes) and it is in any case impossible to instantiate more than one in
>> the current design.
>>
>>
>>
>> I’d like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and maybe
>> …aws.options) somewhat to enable this use-case. I haven’t worked out the
>> details yet, but it will take some thought to make this work in a non-hacky
>> way.
>>
>>
>>
>> Thanks
>>
>> Matt Rudary
>>
>


Re: Environment options for external transforms

2021-02-06 Thread Chad Dombrova
>
>
Hi,

First of all, this is an area that could use a lot of help, so thank you
Kyle for digging through the trove of tickets to understand all of the user
stories.


> I should have led with this. Someone wanted to mount credentials into
> the SDK harness [1]. So in this particular case the user just wants to
> mount files into their SDK harness, which is a pretty common use case, so
> resource hints are probably a more appropriate solution.
>
> [1]
> https://lists.apache.org/thread.html/r690094f1c9ebc4e1d20f029a21ba8bc846672a65baafd57c4f52cb94%40%3Cuser.beam.apache.org%3E
>

 Ah, that clarifies things. Would it be possible/preferable to pass the
 credentials as parameters to the transform itself?

>>>
> This is very useful for testing as well. For example, to test containers
> generated for release candidates.
>

We have been using custom containers since day 1 to address a number of
issues.  Speaking in the abstract, a UDF might rely on system libraries or
executables — which in turn rely on a specific linux OS —  that do not
exist on the standard container.  A more esoteric example is the need to
run a different python interpreter than what is provided in the container.
In our specific case we needed to use python interpreters that come bundled
with the applications that we use to process data, many of which are only
supported on CentOS.  We also had to modify Beam to allow passing docker
options for mounting data volumes.

-chad









>From a design standpoint, I feel find-replace is hacky and backwards. It's
>>> cleaner to specify what kind of environment we want directly in
>>> the ExpansionRequest. That way all of the environment creation logic
>>> belongs inside the expansion service.
>>>
>>
>> While Environments logically belong with Transforms, it is the
>> expansion service's job to attach the right environments to the 
>> transforms
>> that it vends. The caller should not need any visibility into the
>> environment(s) that an expansion service uses, which is an implementation
>> detail that the expansion service is free to change at any time. (In 
>> fact,
>> whether it is (partially or fully) implemented as an external transform 
>> is
>> an implementation detail that the end user should not need to care about 
>> or
>> depend on.)
>>
>> I personally think pattern matching and substitution by runners
>> (maybe more sophisticated than regexp on container names) is a reasonable
>> way to approach customization of environments. For example, suppose I
>> construct a pipeline that uses both Python and Java transforms. (I could 
>> do
>> this from Go, Java, or Python). If I want to run this locally (e.g. on 
>> the
>> Python FnAPI runner), I would prefer that the python bits be run 
>> in-process
>> but would have to shell out (maybe via docker, maybe something cheaper) 
>> for
>> the java bits. On the other hand, if I want to run this same pipeline
>> (ideally, the same model proto, such that we don't have
>> runner-dependent construction) on Flink, I might want the java bits to be
>> inlined and the Python bits to be in a separate process. On Dataflow, 
>> both
>> would live in containers. To do this, the Python runner would say "hey, I
>> know that Python environment" and just swap it out for in-process, and 
>> vice
>> versa. (For isolation/other reasons, one may want the option to force
>> everything to be docker, but that's more of a "don't make substitutions"
>> option than manually providing environment configs.)
>>
>> On the other hand, as we go the route of custom containers,
>> especially expansion services that might vend custom containers, I think 
>> we
>> need a way to push down *properties* of environments (such as resource
>> hints) through the expansion service that may influence the environments
>> that get attached and returned.
>>
>> It would be helpful for me to have concrete usecases of why a user
>> wants to customize the container used by some transform they did not 
>> write,
>> which could possibly inform the best course(s) of action here.
>>
>>
>>
>>>
>>>
>>> On Wed, Feb 3, 2021 at 5:07 PM Chamikara Jayalath <
>>> chamik...@google.com> wrote:
>>>


 On Wed, Feb 3, 2021 at 12:34 PM Kyle Weaver 
 wrote:

> Hi Beamers,
>
> Recently we’ve had some requests on user@ and Slack for
> instructions on how to use custom-built containers in cross-language
> pipelines (typically calling Java transforms from a predominantly 
> Python
> pipeline). Currently, it seems like there is no way to change the 
> container
> used by a cross-language transform except by modifying and rebuilding 

Re: Proposal: Scheduled tasks

2020-12-08 Thread Chad Dombrova
Thanks!

On Tue, Dec 8, 2020 at 6:54 AM Pablo Estrada  wrote:

> Hi Chad!
> I've been meaning to review this, I've just not carved up the time. I'll
> try to get back to you this week with some thoughts!
> Thanks!
> -P.
>
> On Wed, Dec 2, 2020 at 10:31 AM Chad Dombrova  wrote:
>
>> Hi everyone,
>> Beam's niche is low latency, high throughput workloads, but Beam has
>> incredible promise as an orchestrator of long running work that gets sent
>> to a scheduler.  We've created a modified version of Beam that allows the
>> python SDK worker to outsource tasks to a scheduler, like Kubernetes
>> batch jobs[1], Argo[2], or Google's own OpenCue[3].
>>
>> The basic idea is that any element in a stream can be tagged to be
>> executed outside of the normal SdkWorker as an atomic "task".  A task is
>> one invocation of a stage, composed of one or more DoFns, against one a
>> slice of the data stream, composed of one or more tagged elements.   The
>> upshot is that we're able to slice up the processing of a stream across
>> potentially *many* workers, with the trade-off being the added overhead
>> of starting up a worker process for each task.
>>
>> For more info on how we use our modified version of Beam to make visual
>> effects for feature films, check out the talk[4] I gave at the Beam Summit.
>>
>> Here's our design doc:
>>
>> https://docs.google.com/document/d/1GrAvDWwnR1QAmFX7lnNA7I_mQBC2G1V2jE2CZOc6rlw/edit?usp=sharing
>>
>> And here's the github branch:
>> https://github.com/LumaPictures/beam/tree/taskworker_public
>>
>> Looking forward to your feedback!
>> -chad
>>
>>
>> [1] https://kubernetes.io/docs/concepts/workloads/controllers/job/
>> [2] https://argoproj.github.io/
>> [3] https://cloud.google.com/opencue
>> [4] https://www.youtube.com/watch?v=gvbQI3I03a8_channel=ApacheBeam
>>
>>


Proposal: Scheduled tasks

2020-12-02 Thread Chad Dombrova
Hi everyone,
Beam's niche is low latency, high throughput workloads, but Beam has
incredible promise as an orchestrator of long running work that gets sent
to a scheduler.  We've created a modified version of Beam that allows the
python SDK worker to outsource tasks to a scheduler, like Kubernetes batch
jobs[1], Argo[2], or Google's own OpenCue[3].

The basic idea is that any element in a stream can be tagged to be executed
outside of the normal SdkWorker as an atomic "task".  A task is one
invocation of a stage, composed of one or more DoFns, against one a slice
of the data stream, composed of one or more tagged elements.   The upshot
is that we're able to slice up the processing of a stream across
potentially *many* workers, with the trade-off being the added overhead of
starting up a worker process for each task.

For more info on how we use our modified version of Beam to make visual
effects for feature films, check out the talk[4] I gave at the Beam Summit.

Here's our design doc:
https://docs.google.com/document/d/1GrAvDWwnR1QAmFX7lnNA7I_mQBC2G1V2jE2CZOc6rlw/edit?usp=sharing

And here's the github branch:
https://github.com/LumaPictures/beam/tree/taskworker_public

Looking forward to your feedback!
-chad


[1] https://kubernetes.io/docs/concepts/workloads/controllers/job/
[2] https://argoproj.github.io/
[3] https://cloud.google.com/opencue
[4] https://www.youtube.com/watch?v=gvbQI3I03a8_channel=ApacheBeam


Re: PTransform Annotations Proposal

2020-11-16 Thread Chad Dombrova
>
>
> Another example of an optional annotation is marking a transform to run on
> secure hardware, or to give hints to profiling/dynamic analysis tools.
>

There seems to be a lot of overlap between this idea and Environments.  Can
you talk about how you feel they may be different or related?  For example,
I could see annotations as a way of tagging transforms with an Environment,
or I could see Environments becoming a specialized form of annotation.

-chad


Re: Unable to run python formater (Are the instructions out of date?)

2020-11-04 Thread Chad Dombrova
ptions
>>   File
>> "/usr/local/google/home/ajamato/beam/sdks/python/target/.tox/py36-lint/lib/python3.6/site-packages/pip/_internal/cli/cmdoptions.py",
>> line 24, in 
>> from pip._internal.cli.progress_bars import BAR_TYPES
>>   File
>> "/usr/local/google/home/ajamato/beam/sdks/python/target/.tox/py36-lint/lib/python3.6/site-packages/pip/_internal/cli/progress_bars.py",
>> line 7, in 
>> from pip._vendor import six
>> ImportError: cannot import name 'six'
>>
>> 
>> log end
>> =
>> ERROR: could not install deps [-rbuild-requirements.txt]; v =
>> InvocationError('/usr/local/google/home/ajamato/beam/sdks/python/target/.tox/py36-lint/bin/python
>> target/.tox/py36-lint/bin/pip install --retries 10
>> -rbuild-requirements.txt', 1)
>> 
>> summary
>> _
>> ERROR:   py36-lint: could not install deps [-rbuild-requirements.txt]; v
>> =
>> InvocationError('/usr/local/google/home/ajamato/beam/sdks/python/target/.tox/py36-lint/bin/python
>> target/.tox/py36-lint/bin/pip install --retries 10
>> -rbuild-requirements.txt', 1)
>> ==
>>
>> tried pip install six,as well, but I am met with
>> Requirement already satisfied: six in
>> /usr/local/google/home/ajamato/.pyenv/versions/3.6.10/envs/my-virtual-env-3.6.10/lib/python3.6/site-packages
>> (1.15.0)
>>
>>
>>
>> I am guessing something is preventing tox from doing some steps? Does one
>> normally run tox under sudo?
>>
>>
>> On Wed, Nov 4, 2020 at 10:05 PM Chad Dombrova  wrote:
>>
>>>
>>> All of these are great suggestions. I think what I really need though is
>>>> some way to figure out how to cleanly install (perhaps reinstalling)
>>>> everything I need to run all these commands. tox, yapf,
>>>>
>>>
>>> tox should be the only thing you need to install.  After that, tox will
>>> install whatever you need to run the tests.  pre-commit accomplishes
>>> something similar, but just for the pre-commit git hooks.
>>>
>>> -chad
>>>
>>>


Re: Unable to run python formater (Are the instructions out of date?)

2020-11-04 Thread Chad Dombrova
> All of these are great suggestions. I think what I really need though is
> some way to figure out how to cleanly install (perhaps reinstalling)
> everything I need to run all these commands. tox, yapf,
>

tox should be the only thing you need to install.  After that, tox will
install whatever you need to run the tests.  pre-commit accomplishes
something similar, but just for the pre-commit git hooks.

-chad


Re: Unable to run python formater (Are the instructions out of date?)

2020-10-28 Thread Chad Dombrova
I would like to edit it!  I have an apache account and I am a committed but
IIRC I could not edit it with my normal credentials.


On Wed, Oct 28, 2020 at 8:02 PM Robert Burke  wrote:

> (it's a wiki, so anyone who requests and account can improve it)
>
> On Wed, Oct 28, 2020, 7:45 PM Chad Dombrova  wrote:
>
>> It’s unfortunate that those instructions don’t include pre-commit, which
>> is by far the easiest way to do this.
>>
>> To set it up:
>>
>> pip install pre-commit
>> pre-commit install
>>
>> Install sets up git pre-commit hooks so that it will run yapf and pylint
>> on changed files every time you commit (you’ll need python3.7. I think it
>> should be possible to loosen this, as this has been an annoyance for me)
>>
>> To skip running the check on commit add -n:
>>
>> git commit -nm "blah blah"
>>
>> Alternatively, to run the check manually on changed files (pre-commit
>> install is not required to run it this way):
>>
>> pre-commit run yapf
>>
>> Or on all files:
>>
>> pre-commit run -a yapf
>>
>> More info here: https://pre-commit.com/#config-language_version
>>
>> On Wed, Oct 28, 2020 at 6:46 PM Alex Amato  wrote:
>>
>>> I tried both the tox and yapf instructions on the python tips page
>>> <https://cwiki.apache.org/confluence/display/BEAM/Python+Tips#PythonTips-Formatting>.
>>> And the gradle target which failed on PR precommit. I am wondering if there
>>> is something additional I need to setup?
>>>
>>> Here is the output from all three attempts approaches I attempted. Any
>>> ideas how to get this working?
>>>
>>> *(ajamato_env2) ajamato@ajamato-linux0:~/beam/sdks/python$ git diff
>>> --name-only --relative bigquery_python_sdk origin/master | xargs yapf
>>> --in-place*
>>> Traceback (most recent call last):
>>>   File "/usr/local/google/home/ajamato/.local/bin/yapf", line 8, in
>>> 
>>> sys.exit(run_main())
>>>   File
>>> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
>>> line 365, in run_main
>>> sys.exit(main(sys.argv))
>>>   File
>>> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
>>> line 135, in main
>>> verbose=args.verbose)
>>>   File
>>> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
>>> line 204, in FormatFiles
>>> in_place, print_diff, verify, quiet, verbose)
>>>   File
>>> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
>>> line 233, in _FormatFile
>>> logger=logging.warning)
>>>   File
>>> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/yapflib/yapf_api.py",
>>> line 100, in FormatFile
>>> verify=verify)
>>>   File
>>> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/yapflib/yapf_api.py",
>>> line 147, in FormatCode
>>> tree = pytree_utils.ParseCodeToTree(unformatted_source)
>>>   File
>>> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/yapflib/pytree_utils.py",
>>> line 127, in ParseCodeToTree
>>> raise e
>>>   File "apache_beam/metrics/execution.pxd", line 18
>>> cimport cython
>>>  ^
>>> SyntaxError: invalid syntax
>>>
>>> *(ajamato_env2) ajamato@ajamato-linux0:~/beam/sdks/python$ tox -e
>>> py3-yapf*
>>> GLOB sdist-make: /usr/local/google/home/ajamato/beam/sdks/python/setup.py
>>> py3-yapf create:
>>> /usr/local/google/home/ajamato/beam/sdks/python/target/.tox/py3-yapf
>>> ERROR: invocation failed (exit code 1), logfile:
>>> /usr/local/google/home/ajamato/beam/sdks/python/target/.tox/py3-yapf/log/py3-yapf-0.log
>>> ===
>>> log start
>>> 
>>> RuntimeError: failed to build image pkg_resources because:
>>> Traceback (most recent call last):
>>>   File
>>> "/usr/lib/python3/dist-packages/virtualenv/seed/embed/via_app_data/via_app_data.py",
>>> line 60, in _install
>>> installer.install(creator.interpreter.version_info)
>>>   File
>>> "

Re: Unable to run python formater (Are the instructions out of date?)

2020-10-28 Thread Chad Dombrova
It’s unfortunate that those instructions don’t include pre-commit, which is
by far the easiest way to do this.

To set it up:

pip install pre-commit
pre-commit install

Install sets up git pre-commit hooks so that it will run yapf and pylint on
changed files every time you commit (you’ll need python3.7. I think it
should be possible to loosen this, as this has been an annoyance for me)

To skip running the check on commit add -n:

git commit -nm "blah blah"

Alternatively, to run the check manually on changed files (pre-commit
install is not required to run it this way):

pre-commit run yapf

Or on all files:

pre-commit run -a yapf

More info here: https://pre-commit.com/#config-language_version

On Wed, Oct 28, 2020 at 6:46 PM Alex Amato  wrote:

> I tried both the tox and yapf instructions on the python tips page
> .
> And the gradle target which failed on PR precommit. I am wondering if there
> is something additional I need to setup?
>
> Here is the output from all three attempts approaches I attempted. Any
> ideas how to get this working?
>
> *(ajamato_env2) ajamato@ajamato-linux0:~/beam/sdks/python$ git diff
> --name-only --relative bigquery_python_sdk origin/master | xargs yapf
> --in-place*
> Traceback (most recent call last):
>   File "/usr/local/google/home/ajamato/.local/bin/yapf", line 8, in
> 
> sys.exit(run_main())
>   File
> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
> line 365, in run_main
> sys.exit(main(sys.argv))
>   File
> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
> line 135, in main
> verbose=args.verbose)
>   File
> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
> line 204, in FormatFiles
> in_place, print_diff, verify, quiet, verbose)
>   File
> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/__init__.py",
> line 233, in _FormatFile
> logger=logging.warning)
>   File
> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/yapflib/yapf_api.py",
> line 100, in FormatFile
> verify=verify)
>   File
> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/yapflib/yapf_api.py",
> line 147, in FormatCode
> tree = pytree_utils.ParseCodeToTree(unformatted_source)
>   File
> "/usr/local/google/home/ajamato/.local/lib/python2.7/site-packages/yapf/yapflib/pytree_utils.py",
> line 127, in ParseCodeToTree
> raise e
>   File "apache_beam/metrics/execution.pxd", line 18
> cimport cython
>  ^
> SyntaxError: invalid syntax
>
> *(ajamato_env2) ajamato@ajamato-linux0:~/beam/sdks/python$ tox -e py3-yapf*
> GLOB sdist-make: /usr/local/google/home/ajamato/beam/sdks/python/setup.py
> py3-yapf create:
> /usr/local/google/home/ajamato/beam/sdks/python/target/.tox/py3-yapf
> ERROR: invocation failed (exit code 1), logfile:
> /usr/local/google/home/ajamato/beam/sdks/python/target/.tox/py3-yapf/log/py3-yapf-0.log
> ===
> log start
> 
> RuntimeError: failed to build image pkg_resources because:
> Traceback (most recent call last):
>   File
> "/usr/lib/python3/dist-packages/virtualenv/seed/embed/via_app_data/via_app_data.py",
> line 60, in _install
> installer.install(creator.interpreter.version_info)
>   File
> "/usr/lib/python3/dist-packages/virtualenv/seed/embed/via_app_data/pip_install/base.py",
> line 42, in install
> self._sync(filename, into)
>   File
> "/usr/lib/python3/dist-packages/virtualenv/seed/embed/via_app_data/pip_install/copy.py",
> line 13, in _sync
> copy(src, dst)
>   File "/usr/lib/python3/dist-packages/virtualenv/util/path/_sync.py",
> line 53, in copy
> method(norm(src), norm(dest))
>   File "/usr/lib/python3/dist-packages/virtualenv/util/path/_sync.py",
> line 64, in copytree
> shutil.copy(src_f, dest_f)
>   File "/usr/lib/python3.8/shutil.py", line 415, in copy
> copyfile(src, dst, follow_symlinks=follow_symlinks)
>   File "/usr/lib/python3.8/shutil.py", line 261, in copyfile
> with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
> FileNotFoundError: [Errno 2] No such file or directory:
> '/usr/local/google/home/ajamato/beam/sdks/python/target/.tox/py3-yapf/lib/python3.8/site-packages/pkg_resources/_vendor/packaging/__init__.py'
>
>
> 
> log end
> =
> ERROR: InvocationError for command /usr/bin/python3 -m virtualenv
> --no-download --python /usr/bin/python3 py3-yapf (exited with code 1)
> 

Re: Modifying pip install behavior / custom pypi index

2020-09-11 Thread Chad Dombrova
Ok great.  Next question:

What is the relationship between sdks/python/container/boot.go and
Dataflow?  Is this file used within the Dataflow bootstrapping process?

We're currently investigating a switch from Flink to Dataflow, and in doing
so we hope to be able to work our way back to using stock Dataflow
containers wherever possible.  If we make this PR to add pip.conf support,
those changes will be largely made in boot.go, and we'd just like to
confirm that our updates will also make it into Dataflow, verbatim.

-chad



On Fri, Sep 11, 2020 at 2:24 PM Ahmet Altay  wrote:

>
>
> On Fri, Sep 11, 2020 at 2:11 PM Robert Bradshaw 
> wrote:
>
>> Hmm... this is a difficult question. I think adding support for a
>> pip.conf probably makes the most sense, despite it being yet another
>> option.
>>
>
> +1 - I think this is a good flag to add. I heard similar user requests for
> passing specific flags to pip before. Supporting a generic way with an
> optional flag would address those requests.
>
>
>>
>> Another alternative is to simply pre-install the dependencies you want
>> (or even just override /etc/pip.conf) in a custom container.
>>
>> On Wed, Sep 9, 2020 at 5:27 PM Chad Dombrova  wrote:
>>
>>> Hi all,
>>> We are running into problems trying to use our own pypi mirror with
>>> Beam. For those who are not well versed in the esotera of python package
>>> management, pip provides a few ways to specify urls for the pypi index
>>> server:
>>>
>>>- command line
>>>
>>> <https://pip.pypa.io/en/stable/reference/pip_install/#install-index-url>[1]:
>>>via --index-url
>>>- environment variables
>>><https://pip.pypa.io/en/stable/user_guide/#environment-variables>[2]:
>>>via PIP_INDEX_URL. In Beam, we don’t have any way to influence the
>>>environment of the boot process that runs pip install.
>>>- pip.conf <https://pip.pypa.io/en/stable/user_guide/#config-file>[3]:
>>>we could provide this as an artifact, but we don’t have any way of 
>>> placing
>>>it in the correct location (e.g. /etc/pip.conf) on the instance that
>>>runs pip install.
>>>- requirements.txt files can specify certain pip install flags
>>>
>>> <https://pip.pypa.io/en/stable/reference/pip_install/#requirements-file-format>[4],
>>>such as --index-url. As such, passing a requirements file via
>>>--requirements_file would theoretically work, but we also want to be
>>>able to provide dev packages as wheels via --extra_package, which
>>>would be installed independently from the requirements file and thus use
>>>the default pypi index. We may be able to upload our wheel as an artifact
>>>and refer to it using a local path in the requirements file, but this
>>>solution seems a bit brittle as the local artifacts path is different for
>>>each job.
>>>
>>> Are there any known solutions to this problem? Here are some ideas:
>>>
>>>- add support for providing a pip.conf as a known artifact type
>>>(akin to --requirements_file).  this is by far the most powerful and
>>>straightforward solution, but do we have the stomach for yet another cli
>>>option?
>>>- add support for providing a destination path for artifacts, which
>>>would let us install it into /etc/pip.conf. I can see strong
>>>safety/security concerns around this.
>>>- provide a guarantee that the working directory for the boot
>>>process is inside the artifact directory: then we could refer to wheels
>>>inside our requirements file using relative paths.
>>>
>>> We're happy to make a pull request to add support for this feature, but
>>> it'd be great to have some input on the ideal solution before we begin.
>>>
>>> thanks!
>>> -chad
>>>
>>> [1]
>>> https://pip.pypa.io/en/stable/reference/pip_install/#install-index-url
>>> [2] https://pip.pypa.io/en/stable/user_guide/#environment-variables
>>> [3] https://pip.pypa.io/en/stable/user_guide/#config-file
>>> [4]
>>> https://pip.pypa.io/en/stable/reference/pip_install/#requirements-file-format
>>>
>>> -chad
>>>
>>>


Modifying pip install behavior / custom pypi index

2020-09-09 Thread Chad Dombrova
Hi all,
We are running into problems trying to use our own pypi mirror with Beam.
For those who are not well versed in the esotera of python package
management, pip provides a few ways to specify urls for the pypi index
server:

   - command line
   [1]:
   via --index-url
   - environment variables
   [2]:
   via PIP_INDEX_URL. In Beam, we don’t have any way to influence the
   environment of the boot process that runs pip install.
   - pip.conf [3]:
   we could provide this as an artifact, but we don’t have any way of placing
   it in the correct location (e.g. /etc/pip.conf) on the instance that
   runs pip install.
   - requirements.txt files can specify certain pip install flags
   
[4],
   such as --index-url. As such, passing a requirements file via
   --requirements_file would theoretically work, but we also want to be
   able to provide dev packages as wheels via --extra_package, which would
   be installed independently from the requirements file and thus use the
   default pypi index. We may be able to upload our wheel as an artifact and
   refer to it using a local path in the requirements file, but this solution
   seems a bit brittle as the local artifacts path is different for each job.

Are there any known solutions to this problem? Here are some ideas:

   - add support for providing a pip.conf as a known artifact type (akin to
   --requirements_file).  this is by far the most powerful and
   straightforward solution, but do we have the stomach for yet another cli
   option?
   - add support for providing a destination path for artifacts, which
   would let us install it into /etc/pip.conf. I can see strong
   safety/security concerns around this.
   - provide a guarantee that the working directory for the boot process is
   inside the artifact directory: then we could refer to wheels inside our
   requirements file using relative paths.

We're happy to make a pull request to add support for this feature, but
it'd be great to have some input on the ideal solution before we begin.

thanks!
-chad

[1] https://pip.pypa.io/en/stable/reference/pip_install/#install-index-url
[2] https://pip.pypa.io/en/stable/user_guide/#environment-variables
[3] https://pip.pypa.io/en/stable/user_guide/#config-file
[4]
https://pip.pypa.io/en/stable/reference/pip_install/#requirements-file-format

-chad


Re: Python2.7 Beam End-of-Life Date

2020-06-18 Thread Chad Dombrova
>>>>
>>>>>>> >>>>>> On Thu, Jun 4, 2020 at 4:53 PM Valentyn Tymofieiev <
>>>>>>> valen...@google.com> wrote:
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Back at the end of February we decided to revisit this
>>>>>>> conversation in 3 months. Do folks on this thread have any new input or
>>>>>>> perspective regarding us balancing "user pain/contributor pain/our 
>>>>>>> ability
>>>>>>> to continuously test with python 2 in a shifting environment"?
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Some new information on my end is that we have been seeing
>>>>>>> steady adoption of Python 3 among Beam Python users in Dataflow,
>>>>>>> particularly strong adoption among streaming users, and Dataflow is
>>>>>>> sunsetting Python 2 support for all released Beam SDKs later this year 
>>>>>>> [1].
>>>>>>> We will have to remove Python 2 Beam test suites that use Dataflow  when
>>>>>>> Dataflow runner disables Py2 support if this happens before Beam Py2 EOL
>>>>>>> (when we have to remove all Py2 suites), including performance tests 
>>>>>>> that
>>>>>>> still use Dataflow on Python 3.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> I am curious how much motivation there is in the community
>>>>>>> at this moment to continue Py2 support in Beam,  whether any previous 
>>>>>>> Py3
>>>>>>> migration blockers were resolved or any new blockers discovered among 
>>>>>>> Beam
>>>>>>> users.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> [1]
>>>>>>> https://cloud.google.com/python/docs/python2-sunset/#dataflow
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> On Fri, May 8, 2020 at 3:52 PM Valentyn Tymofieiev <
>>>>>>> valen...@google.com> wrote:
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> That's good news! Thanks for sharing.
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> Another datapoint, here are a few of Beam's dependencies
>>>>>>> that no longer release new py2 artifacts (I looked at REQUIRED_PACKAGES 
>>>>>>> +
>>>>>>> aws, gcp, and interactive extras):
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> hdfs
>>>>>>> >>>>>>>> numpy
>>>>>>> >>>>>>>> pyarrow
>>>>>>> >>>>>>>> ipython
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> There are more if we include transitive dependencies and
>>>>>>> test-only packages. I also remember encountering one issue last month 
>>>>>>> that
>>>>>>> was broken only on Py2, which we had to go back and fix.
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> If others have noticed frictions related to ongoing Py2
>>>>>>> support or have updates on previously mentioned Py3 migration blockers,
>>>>>>> feel free to post them.
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> On Fri, May 8, 2020 at 9:19 AM Robert Bradshaw <
>>>>>>> rober...@google.com> wrote:
>>>>>>> >>>>>>>>>
>>>>>>> >>>>>>>>> It hasn't been 3 months yet, but I wanted to call out a
>>>>>>> milestone that
>>>>>>> >>>>>>>>> Python 3 downloads crossed the 50% threshold on pypi, if
>>>>>>> just briefly.
>>>>>>> >>>>>>>>>
>>>>>>> >>>>>>>>> On Thu, Feb 13, 2020 at 12:40 AM Ismaël Mejía <
>>>>>>> ieme...@gmail.com> wrote

Re: DataflowRunner | Cross-language

2020-06-08 Thread Chad Dombrova
> Even when running portably, Dataflow still has its own implementation of
> PubSubIO that is switched out for Python's "implementation." (It's actually
> built into the same layer that provides the shuffle/group-by-key
> implementation.) However, if you used the external Java PubSubIO it may not
> recognize this and continue to use that implementation even on dataflow.
>

That's great, actually, as we still have some headaches around using the
Java PubSubIO transform: it requires a custom build of the Java Beam API
and SDK container to add missing dependencies and properly deal with data
conversions from python<->java.

Next question: when using Dataflow+Portability can we specify our own
docker container for the Beam Python SDK when using the Docker executor?

We have two reasons to do this:
1) we have some environments that cannot be bootstrapped on top of the
stock Beam SDK image
2) we have a somewhat modified version of the Beam SDK (changes which we
eventually hope to contribute back, but won't be able to for at least a few
months).

If yes, what are the restrictions around custom SDK images?  e.g. must be
the same version of Beam, must be on a registry accessible to Dataflow,
etc...

thanks
-chad


Re: DataflowRunner | Cross-language

2020-06-08 Thread Chad Dombrova
Hi all,
quick followup question:


> small correction. While the new runner will be available with Beam 2.21,
>> the Cross-Language support will be available in 2.22.
>> There will be limitations in the initial set of connectors you can use
>> with Cross-Lang. But at least you will have something to test with,
>> starting in 2.22
>>
>
> To clarify, we're not actually prohibiting any other
> cross-langauge transforms being used, but Kafka is the only one that'll be
> extensively tested and supported at this time.
>

We're currently using the Flink runner with external Java PubSubIO
transforms in our python pipelines because there is no pure python option.
 In its non-portable past, Dataflow has had its own native implementation
of PubSubIO, that got switched out at runtime, so there was no need to use
external transforms there.  What's the story around PubSubIO when using
Dataflow + portability?  If we were to switch from Flink to Dataflow, would
we continue to use external Java PubSubIO transforms, or is there still
some special treatment of pubsub for Portable Dataflow?

-chad


Re: Install Jenkins AnsiColor plugin

2020-03-08 Thread Chad Dombrova
I don’t believe that it was ever resolved.  I have a PR with a bunch of
attempts to get it working but I never did figure it out.  IIRC there did
seem to be some ansi plugin already installed but I couldn’t get it to
work.

-chad


On Sun, Mar 8, 2020 at 10:52 AM Ismaël Mejía  wrote:

> Did this ever happen? If not what is blocking it?
>
>
>
> On Tue, Oct 22, 2019 at 10:13 PM Udi Meiri  wrote:
> >
> > Your proposal will only affect the seed job (which doesn't do color
> outputs AFAIK).
> > I think you want to add colorizeOutput() here:
> >
> https://github.com/apache/beam/blob/bfebbd0d16361f61fa40bfdec2f0cb6f943f7c9a/.test-infra/jenkins/CommonJobProperties.groovy#L79-L95
> >
> > Otherwise no concerns from me.
> >
> > On Tue, Oct 22, 2019 at 12:01 PM Chad Dombrova 
> wrote:
> >>
> >> thanks, so IIUC, I’m going to update job_00_seed.groovy like this:
> >>
> >>   wrappers {
> >> colorizeOutput()
> >> timeout {
> >>   absolute(60)
> >>   abortBuild()
> >> }
> >>   }
> >>
> >> Then add the comment run seed job
> >>
> >> Does anyone have any concerns with me trying this out now?
> >>
> >> -chad
> >>
> >>
> >> On Tue, Oct 22, 2019 at 11:42 AM Udi Meiri  wrote:
> >>>
> >>> Also note that changing the job DSL doesn't take effect until the
> "seed" job runs. (use the "run seed job" phrase)
> >>>
> >>> On Tue, Oct 22, 2019 at 11:06 AM Chad Dombrova 
> wrote:
> >>>>
> >>>> Thanks, I'll look into this.  I have a PR I'm building up with a
> handful of minor changes related to this.
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Oct 22, 2019 at 10:45 AM Yifan Zou 
> wrote:
> >>>>>
> >>>>> Thanks, Udi! The ansicolor plugin was applied to ASF Jenkins
> universally. You might need to explicitly enable the coloroutput in your
> jenkins dsl.
> >>>>>
> >>>>> On Tue, Oct 22, 2019 at 10:33 AM Udi Meiri  wrote:
> >>>>>>
> >>>>>> Seems to be already installed:
> https://issues.apache.org/jira/browse/INFRA-16944
> >>>>>> Do we just need to enable it somehow?
> >>>>>> This might work:
> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.helpers.wrapper.WrapperContext.colorizeOutput
> >>>>>>
> >>>>>> BTW, our Jenkins is maintained by ASF's Infrastructure team:
> https://cwiki.apache.org/confluence/display/INFRA/Jenkins
> >>>>>>
> >>>>>> On Tue, Oct 22, 2019 at 10:23 AM Chad Dombrova 
> wrote:
> >>>>>>>
> >>>>>>> Hi all,
> >>>>>>> As a user trying to grok failures in jenkins I think it would be a
> huge help to have color output support.  This is something that works out
> of the box for CI tools like gitlab and travis, and it really helps bring
> that 21st century feel to your logs :)
> >>>>>>>
> >>>>>>> There's a Jenkins plugin for colorizing ansi escape sequences here:
> >>>>>>> https://plugins.jenkins.io/ansicolor
> >>>>>>>
> >>>>>>> I think this is something that has to be deployed by a Jenkins
> admin.
> >>>>>>>
> >>>>>>> -chad
> >>>>>>>
>


Re: Python Static Typing: Next Steps

2020-03-03 Thread Chad Dombrova
>
> This probably does not apply yet, does optional mean that opting-in for
> all or none of the precommit hooks? Would contributors have a choice of
> which pre-commits they can opt-in too?
>

Yes, once the hook is installed, individual checks are opt-out and the
design is clearly built around a one-time override, rather than setting up
a long-term preference: https://pre-commit.com/#temporarily-disabling-hooks

There might be better way that's not documented.


Re: Python Static Typing: Next Steps

2020-03-03 Thread Chad Dombrova
>
>
>> +1. Also would not running dmypy would require all contributors to run a
> process all the time? I do not know if this is desired by existing
> contributors, and I am not sure if it will be friendly to new contributors.
>

Pre-commit git hooks are completely opt-in.   Developers who choose not to
use them will simply continue doing what they do now.  The Jenkins jobs
remain the only gatekeeper that really matters.  But as a developer I'd
rather get all my linting and mypy checks done in a second each time I
commit, so that I know that the Jenkins lint job will succeed when I push
up my changes.  Plus there's a lot of overhead running the lint jobs via
tox.

Also note that even after you've setup the pre-commit hook in git, you can
opt out for a specific commit by passing -n to git commit.


Re: Python Static Typing: Next Steps

2020-03-03 Thread Chad Dombrova
I tested out dmypy (the mypy daemon) last night and it was completing in
under a second after editing a file and rerunning (usually around 0.6s),
which puts it into pre-commit-hook territory.

-chad




On Tue, Mar 3, 2020 at 1:54 AM Ismaël Mejía  wrote:

> +1 to do it by default. Great to see the typing work arrive to this
> maturity milestone.
> We can also refer to some of the mypy typing docs for newbies on the
> subject.
>
> On Tue, Mar 3, 2020 at 10:15 AM Kamil Wasilewski
>  wrote:
> >
> > +1 for enabling mypy as a precommit job
> >
> > This however could be a good occasion to rework the current PythonLint
> job. Since yapf has been introduced, some of the checks made by
> pylint/flake are now unnecessary and could be dismantled. This would
> speed-up PythonLint quite a lot.
> > I volunteer to help with anything as well.
> >
> > On Tue, Mar 3, 2020 at 1:43 AM Robert Bradshaw 
> wrote:
> >>
> >> It seems people are conflating git pre-commit hooks (which IMHO should
> >> ideally be in the sub-second range, and run when an author does "git
> >> commit") with jenkins pre-commit tests (for which minutes is nothing
> >> compared to what we already do). I am +1 to adding mypy to the latter
> >> for sure, and think we should probably hold off for the former.
> >>
> >> On Mon, Mar 2, 2020 at 4:38 PM Udi Meiri  wrote:
> >> >
> >> > Off-topic: Python lint via pre-commit should be much faster. (I wrote
> my own modified-file-only lint in the past)
> >> >
> >> > On Mon, Mar 2, 2020 at 2:08 PM Kyle Weaver 
> wrote:
> >> >>
> >> >> > Python lint takes 4-5mins to complete. I think if the mypy
> analysis is really on the order of 10s, the additional time won't matter
> and could always be enabled.
> >> >>
> >> >> +1 of course it would be nice to make mypy as fast as possible, but
> I don't think speed needs to be a blocker. The productivity gains we'd get
> from reliable type analysis more than offset the cost IMO.
> >> >>
> >> >> On Mon, Mar 2, 2020 at 2:03 PM Luke Cwik  wrote:
> >> >>>
> >> >>> Python lint takes 4-5mins to complete. I think if the mypy analysis
> is really on the order of 10s, the additional time won't matter and could
> always be enabled.
> >> >>>
> >> >>> On Mon, Mar 2, 2020 at 1:21 PM Chad Dombrova 
> wrote:
> >> >>>>>
> >> >>>>> I believe that mypy via pre-commit hook will be faster than 10s
> since it only applies to modified files.
> >> >>>>
> >> >>>>
> >> >>>> Correct, with a few caveats:
> >> >>>>
> >> >>>> pre-commit can be setup to only run if a python file changes.  so
> modifying a java file won't trigger mypy to run.
> >> >>>> if *any* python file changes mypy has to run on the whole
> codebase, because a change to one file can affect the others (i.e. a
> function arg type changes).  it's not really meaningful to run mypy on a
> single file.
> >> >>>> the mypy daemon tracks which files have changed, and runs
> incremental updates.  so if we setup the precommit hook to run the daemon,
> we should see that get appreciably faster.  I'll do some tests and report
> back.
>


Re: Python Static Typing: Next Steps

2020-03-02 Thread Chad Dombrova
>
> I believe that mypy via pre-commit hook will be faster than 10s since it
> only applies to modified files.
>

Correct, with a few caveats:

   - pre-commit can be setup to only run if a python file changes.  so
   modifying a java file won't trigger mypy to run.
   - if *any* python file changes mypy has to run on the whole codebase,
   because a change to one file can affect the others (i.e. a function arg
   type changes).  it's not really meaningful to run mypy on a single file.
   - the mypy daemon tracks which files have changed, and runs incremental
   updates.  so if we setup the precommit hook to run the daemon, we should
   see that get appreciably faster.  I'll do some tests and report back.


Python Static Typing: Next Steps

2020-03-02 Thread Chad Dombrova
Good news everyone!
We nearly have the full beam codebase passing in mypy.

As we are now approaching the zero-error event horizon, I'd like to open up
a discussion around enabling mypy in the PythonLint job.  Every day or so a
PR is merged that introduces some new mypy errors, so enabling this test is
the only way I see to keep the annotations accurate and thus useful.

Developer fatigue is a real concern here, since static typing has a a steep
learning curve, and there are still not a lot of experts to help consult on
PRs.  Here are some things that I hope will mitigate those concerns:

   - We have a lot of tying coverage, so that means plenty of examples of
   how to solve different types of problems
   - Running mypy only takes 10 seconds to complete (if you execute it
   outside of gradle / tox), and that will get better when we get to 0
   errors.  Also, running mypy in daemon mode should speed that up even more
   - I have a PR[1] to allow developers to easily (and optionally) setup
   yapf to run in a local git pre-commit hook;  I'd like to do the same for
   mypy.
   - I will make myself and members of my team available to help out with
   typing questions in PRs

Is there anyone else on the list who is knowledgable about python static
typing who would like to volunteer to be flagged on typing questions?

What else can we do to make this transition easier?

[1] https://github.com/apache/beam/pull/10810

-chad


Re: [ANNOUNCE] New committer: Chad Dombrova

2020-02-24 Thread Chad Dombrova
Thanks, folks!  I'm very excited to "retest this" :)

Especially big thanks to Robert and Udi for all their hard work reviewing
my PRs.

-chad


On Mon, Feb 24, 2020 at 1:44 PM Brian Hulette  wrote:

> Congratulations Chad! Thanks for all your contributions :)
>
> On Mon, Feb 24, 2020 at 1:43 PM Kyle Weaver  wrote:
>
>> Well-deserved, thanks for your dedication to the project Chad. :)
>>
>> On Mon, Feb 24, 2020 at 1:34 PM Udi Meiri  wrote:
>>
>>> Congrats and welcome, Chad!
>>>
>>> On Mon, Feb 24, 2020 at 1:21 PM Pablo Estrada 
>>> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> Please join me and the rest of the Beam PMC in welcoming a new
>>>> committer: Chad Dombrova
>>>>
>>>> Chad has contributed to the project in multiple ways, including
>>>> improvements to the testing infrastructure, and adding type annotations
>>>> throughout the Python SDK, as well as working closely with the community on
>>>> these improvements.
>>>>
>>>> In consideration of his contributions, the Beam PMC trusts him with the
>>>> responsibilities of a Beam Committer[1].
>>>>
>>>> Thanks Chad for your contributions!
>>>>
>>>> -Pablo, on behalf of the Apache Beam PMC.
>>>>
>>>> [1]
>>>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>>>
>>>>
>>>


Re: Cross-language pipelines status

2020-02-19 Thread Chad Dombrova
The Java deps are only half of the problem. The other half is that PubsubIO
and KafkaIO are using classes that do not have a python equivalent and thus
no universal coder.  The solution discussed in the issue I linked above was
to use row coder registries in Java, to convert from these types to rows /
schemas.

Any thoughts on that?

-chad


On Wed, Feb 19, 2020 at 6:00 PM Robert Bradshaw  wrote:

> Hopefully this should be resovled by
> https://issues.apache.org/jira/browse/BEAM-9229
>
> On Wed, Feb 19, 2020 at 5:52 PM Chad Dombrova  wrote:
> >
> > We are using external transforms to get access to PubSubIO within
> python.  It works well, but there is one major issue remaining to fix:  we
> have to build a custom beam with a hack to add the PubSubIO java deps and
> fix up the coders.  This affects KafkaIO as well.  There's an issue here:
> https://issues.apache.org/jira/browse/BEAM-7870
> >
> > I consider this to be the most pressing problem with external transforms
> right now.
> >
> > -chad
> >
> >
> >
> > On Wed, Feb 12, 2020 at 9:28 AM Chamikara Jayalath 
> wrote:
> >>
> >>
> >>
> >> On Wed, Feb 12, 2020 at 8:10 AM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
> >>>
> >>>
> >>>> AFAIK, there's no official guide for cross-language pipelines. But
> there are examples and test cases you can use as reference such as:
> >>>>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_xlang.py
> >>>>
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIOExternalTest.java
> >>>>
> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/ValidateRunnerXlangTest.java
> >>>>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/expansion_service_test.py
> >>>
> >>>
> >>> I'm trying to work with tech writers to add more documentation related
> to cross-language (in a few months). But any help related to documenting
> what we have now is greatly appreciated.
> >>>
> >>>
> >>> That would be great since now the information is a bit scattered over
> different places. I’d be happy to help with any examples and their testing
> that I hope I’ll have after a while.
> >>
> >>
> >> Great.
> >>
> >>>
> >>>
> >>>> The runner and SDK supports are in working state I could say but not
> many IOs expose their cross-language interface yet (you can easily write
> cross-language configuration for any Python transforms by yourself though).
> >>>
> >>>
> >>> Should mention here the test suites for portable Flink and Spark
> Heejong added recently :)
> >>>
> >>>
> https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Flink/
> >>>
> https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Spark/
> >>>
> >>>
> >>> Nice! Looks like my question above about cross-language support in
> Spark runner was redundant.
> >>>
> >>>
> >>>>
> >>>>
> >>>>>
> >>>>> - Is the information here
> https://beam.apache.org/roadmap/connectors-multi-sdk/ up-to-date? Are
> there any other entry points you can recommend?
> >>>>
> >>>>
> >>>> I think it's up-to-date.
> >>>
> >>>
> >>> Mostly up to date.  Testing status is more complete now and we are
> actively working on getting the dependences story correct and adding
> support for DataflowRunner.
> >>>
> >>>
> >>> Are there any “umbrella" Jiras regarding cross-language support that I
> can track?
> >>
> >>
> >> I don't think we have an umbrella JIRA currently. I can create one and
> mention it in the roadmap.
>


Re: Cross-language pipelines status

2020-02-19 Thread Chad Dombrova
We are using external transforms to get access to PubSubIO within python.
It works well, but there is one major issue remaining to fix:  we have to
build a custom beam with a hack to add the PubSubIO java deps and fix up
the coders.  This affects KafkaIO as well.  There's an issue here:
https://issues.apache.org/jira/browse/BEAM-7870

I consider this to be the most pressing problem with external transforms
right now.

-chad



On Wed, Feb 12, 2020 at 9:28 AM Chamikara Jayalath 
wrote:

>
>
> On Wed, Feb 12, 2020 at 8:10 AM Alexey Romanenko 
> wrote:
>
>>
>> AFAIK, there's no official guide for cross-language pipelines. But there
>>> are examples and test cases you can use as reference such as:
>>>
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_xlang.py
>>>
>>> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIOExternalTest.java
>>>
>>> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/ValidateRunnerXlangTest.java
>>>
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/expansion_service_test.py
>>>
>>
>> I'm trying to work with tech writers to add more documentation related to
>> cross-language (in a few months). But any help related to documenting what
>> we have now is greatly appreciated.
>>
>>
>> That would be great since now the information is a bit scattered over
>> different places. I’d be happy to help with any examples and their testing
>> that I hope I’ll have after a while.
>>
>
> Great.
>
>
>>
>> The runner and SDK supports are in working state I could say but not many
>>> IOs expose their cross-language interface yet (you can easily write
>>> cross-language configuration for any Python transforms by yourself though).
>>>
>>
>> Should mention here the test suites for portable Flink and Spark Heejong
>> added recently :)
>>
>>
>> https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Flink/
>>
>> https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Spark/
>>
>>
>> Nice! Looks like my question above about cross-language support in Spark
>> runner was redundant.
>>
>>
>>
>>>
>>>
 - Is the information here
 https://beam.apache.org/roadmap/connectors-multi-sdk/ up-to-date? Are
 there any other entry points you can recommend?

>>>
>>> I think it's up-to-date.
>>>
>>
>> Mostly up to date.  Testing status is more complete now and we are
>> actively working on getting the dependences story correct and adding
>> support for DataflowRunner.
>>
>>
>> Are there any “umbrella" Jiras regarding cross-language support that I
>> can track?
>>
>
> I don't think we have an umbrella JIRA currently. I can create one and
> mention it in the roadmap.
>


Re: [DISCUSS] Autoformat python code with Black

2020-02-07 Thread Chad Dombrova
t;>> Thanks! Now we get to debate what knobs to twiddle :-P
> >>>>>>> >> > >>>>
> >>>>>>> >> > >>>> FYI, I did a simple run (just pushed to
> >>>>>>> >> > >>>>
> https://github.com/apache/beam/compare/master...robertwb:yapf) to see
> >>>>>>> >> > >>>> the impact. The diff is
> >>>>>>> >> > >>>>
> >>>>>>> >> > >>>> $ git diff --stat master
> >>>>>>> >> > >>>> ...
> >>>>>>> >> > >>>>  547 files changed, 22118 insertions(+), 21129
> deletions(-)
> >>>>>>> >> > >>>>
> >>>>>>> >> > >>>> For reference
> >>>>>>> >> > >>>>
> >>>>>>> >> > >>>> $ find sdks/python/apache_beam -name '*.py' | xargs
> wc
> >>>>>>> >> > >>>> ...
> >>>>>>> >> > >>>> 200424  612002 7431637 total
> >>>>>>> >> > >>>>
> >>>>>>> >> > >>>> which means a little over 10% of lines get touched. I
> think there are
> >>>>>>> >> > >>>> some options, such as
> SPLIT_ALL_TOP_LEVEL_COMMA_SEPARATED_VALUES and
> >>>>>>> >> > >>>> COALESCE_BRACKETS, that will conform more to the style
> we are already
> >>>>>>> >> > >>>> (mostly) following.
> >>>>>>> >> > >>>>
> >>>>>>> >> > >>>>
> >>>>>>> >> > >>>> On Thu, Jan 23, 2020 at 1:59 AM Kamil Wasilewski
> >>>>>>> >> > >>>>  wrote:
> >>>>>>> >> > >>>> >
> >>>>>>> >> > >>>> > Thank you Michał for creating the ticket. I have some
> free time and I'd like to volunteer myself for this task.
> >>>>>>> >> > >>>> > Indeed, it looks like there's consensus for `yapf`, so
> I'll try `yapf` first.
> >>>>>>> >> > >>>> >
> >>>>>>> >> > >>>> > Best,
> >>>>>>> >> > >>>> > Kamil
> >>>>>>> >> > >>>> >
> >>>>>>> >> > >>>> >
> >>>>>>> >> > >>>> > On Thu, Jan 23, 2020 at 10:37 AM Michał Walenia <
> michal.wale...@polidea.com> wrote:
> >>>>>>> >> > >>>> >>
> >>>>>>> >> > >>>> >> Hi all,
> >>>>>>> >> > >>>> >> I created a JIRA issue for this and summarized the
> available tools
> >>>>>>> >> > >>>> >>
> >>>>>>> >> > >>>> >> https://issues.apache.org/jira/browse/BEAM-9175
> >>>>>>> >> > >>>> >>
> >>>>>>> >> > >>>> >> Cheers,
> >>>>>>> >> > >>>> >> Michal
> >>>>>>> >> > >>>> >>
> >>>>>>> >> > >>>> >> On Thu, Jan 23, 2020 at 1:49 AM Udi Meiri <
> eh...@google.com> wrote:
> >>>>>>> >> > >>>> >>>
> >>>>>>> >> > >>>> >>> Sorry, backing off on this due to time constraints.
> >>>>>>> >> > >>>> >>>
> >>>>>>> >> > >>>> >>> On Wed, Jan 22, 2020 at 3:39 PM Udi Meiri <
> eh...@google.com> wrote:
> >>>>>>> >> > >>>> >>>>
> >>>>>>> >> > >>>> >>>> It sounds like there's a consensus for yapf. I
> volunteer to take this on
> >>>>>>> >> > >>>> >>>>
> >>>>>>> >> > >>>> >>>> On Wed, Jan 22, 

Re: Python2.7 Beam End-of-Life Date

2020-02-04 Thread Chad Dombrova
>
>
>  Not to mention that all the nice work for the type hints will have to be
> redone in the for 3.x.
>

Note that there's a tool for automatically converting type comments to
annotations: https://github.com/ilevkivskyi/com2ann

So don't let that part bother you.

I'm curious what other features you'd like to be using in the Beam source
that you cannot now.

It seems the faster we drop support the better.
>

I've already gone over my position on this, but a refresher for those who
care:  some of the key vendors that support my industry will not offer
python3-compatible versions of their software until the 4th quarter of
2020.  If Beam switches to python3-only before that point we may be forced
to stop contributing features (note: I'm the guy who added the type hints
:).   Every month you can give us would be greatly appreciated.

-chad


Re: [DISCUSSION] Improve release notes by adding a change list file

2020-02-01 Thread Chad Dombrova
In case it's of any use, there's a tool called towncrier[1] to help compile
changelog fragments and compile them at time of delivery.

I came across this when working on the python-attrs[2] project, which has
some good documentation for contributors on how to use it:
https://www.attrs.org/en/stable/contributing.html#changelog



[1] https://github.com/hawkowl/towncrier
[2] https://github.com/python-attrs/attrs


On Fri, Jan 31, 2020 at 5:09 PM Ahmet Altay  wrote:

> Thank you for the quick responses. I sent out
> https://github.com/apache/beam/pull/10743 to make this change. Please
> provide feedback or directly edit the PR.
>
> On Fri, Jan 31, 2020 at 3:58 PM Robert Bradshaw 
> wrote:
>
>> Yes, yes, yes! This is the one model of release notes that I've
>> actually seen work well at scale.
>>
>>
>> https://lists.apache.org/thread.html/41e03ace17dbcccf7e267ba6d538736b2a99a8e73e7fb45702766b17%40%3Cdev.beam.apache.org%3E
>>
>> Let's make it happen.
>>
>> On Fri, Jan 31, 2020 at 3:47 PM Robert Burke  wrote:
>> >
>> > I like this suggestion, Jira titles and commit summaries don't
>> necessarily reflect the user impact for a given change (or set of changes).
>> Being able to see the Forest instead of the trees.
>> >
>> > On Fri, Jan 31, 2020, 3:37 PM Kenneth Knowles  wrote:
>> >>
>> >> +1
>> >>
>> >> This is a great idea. Hope it can lead to higher-value view of
>> relevant changes.
>> >>
>> >> I like it being in the root of the repo, so it lives next to the code.
>> >>
>> >> Since the website is also markdown, it could be copied over directly
>> at release time, so it can be browsed there, too.
>> >>
>> >> Kenn
>> >>
>> >> On Fri, Jan 31, 2020 at 3:16 PM Ahmet Altay  wrote:
>> >>>
>> >>> Hi all,
>> >>>
>> >>> We currently have two major ways to communicate changes in a release:
>> >>> - A blog post, to highlight major changes in the release. (Example
>> for 2.17: [1])
>> >>> - JIRA release notes pages listing all issues tagged for a specific
>> release. (Example for 2.17 [2]).
>> >>>
>> >>> There are a few issues with this process:
>> >>> - It is difficult for the release manager to know what is important,
>> what is a breaking change, what is dependency change etc. For example,
>> there were more than 150 Jira issues tagged for 2.17 release.
>> >>> - Release blog has many items, and does not necessarily communicate
>> important changes. It is difficult for users to discover major changes
>> short of going through a large list.
>> >>> - People involved in authoring or reviewing a PRs usually have the
>> most context about the change, and they are not necessarily involved in the
>> release process to provide this additional information.
>> >>>
>> >>> Would it be helpful if we maintain a simple change list file and
>> update it as part of the PRs with noteworthy changes? Release managers
>> could use this information as is in their blog posts (or link to it). Users
>> will have a single place to find highlights from various versions.
>> >>>
>> >>> Concretely, I am proposing:
>> >>> - Adding a CHANGES file to the root of the repository. (Name could be
>> anything, TFX uses RELEASE.md in their repo. [3])
>> >>> - Ask PR authors to update this file as part of their PR whenever it
>> makes sense
>> >>> - Reference this file during the release process, and a new section
>> for the next release after each release.
>> >>>
>> >>> Ahmet
>> >>>
>> >>> [1] https://beam.apache.org/blog/2020/01/06/beam-2.17.0.html
>> >>> [2]
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12345970=12319527
>> >>> [3] https://github.com/tensorflow/tfx/blob/master/RELEASE.md
>>
>


Re: [Discuss] Beam Summit 2020 Dates & locations

2020-01-22 Thread Chad Dombrova
Hi all,
Did we come to a consensus on dates and locations for the summits?
Particularly interested in the North America Summit.

Thanks,
-chad


On Tue, Nov 26, 2019 at 7:26 AM Alexey Romanenko 
wrote:

> Probably, it would make sense to wait a bit for October (or September)
> dates since the dates of ApacheCon 2020 have not been officially set and
> announced yet, afaik.
>
>
> On 26 Nov 2019, at 03:55, Ahmet Altay  wrote:
>
>
>
> On Thu, Nov 21, 2019 at 2:49 PM Aizhamal Nurmamat kyzy <
> aizha...@apache.org> wrote:
>
>> Maria put together this documents with related industry conferences [1],
>> it would make sense to choose a time that doesn't conflict with other
>> events around projects close to Beam.
>>
>> How about for June 21-22 (around Spark Summit) for North America, and
>> October 5-6 or October 12-13 for Europe?
>>
>
> Sounds good to me. IMO, October 12-13 for Europe is better, in between
> related flink and spark summits.
>
>
>>
>> [1]
>> https://docs.google.com/spreadsheets/d/1LQv95XP9UPhZjGqcMA8JksfcAure-fAp4h3TTjaafRs/edit#gid=1680445982
>>
>> On Tue, Nov 12, 2019 at 4:45 PM Udi Meiri  wrote:
>>
>>> +1 for better organization. I would have gone to ApacheCon LV had I
>>> known there was going to be a Beam summit there.
>>>
>>> On Tue, Nov 12, 2019 at 9:31 AM Alexey Romanenko <
>>> aromanenko@gmail.com> wrote:
>>>
 On 8 Nov 2019, at 11:32, Maximilian Michels  wrote:
 >
 > The dates sounds good to me. I agree that the bay area has an
 advantage because of its large tech community. On the other hand, it is a
 question of how we run the event. For Berlin we managed to get about 200
 attendees to Berlin, but for the BeamSummit in Las Vegas with ApacheCon the
 attendance was much lower.

 I agree with your point Max and I believe that it would be more
 efficient to run Beam Summit as a “standalone" event (as it was done in
 London and Berlin) which will allow us to attract mostly
 Beam-oriented/interested/focused audience comparing to running this as part
 of ApacheCon or any other large conferences where are many other different
 topics and tracks.

 > Should this also be discussed on the user mailing list?

 Definitively! Despite the fact that users opinion is a key point here,
 it will not be so easy to get not-biased statistics in this question.

 The time frames are also very important since holidays in different
 countries (for example, August is traditionally a "vacation month" in
 France and some other European countries) can effect people availability
 and influent the final number of participants in the end.

 >
 > Cheers,
 > Max
 >
 > On 07.11.19 22:50, Alex Van Boxel wrote:
 >> For date wise, I'm wondering why we should switching the Europe and
 NA one, this would mean that the Berlin and the new EU summit would be
 almost 1.5 years apart.
 >>  _/
 >> _/ Alex Van Boxel
 >> On Thu, Nov 7, 2019 at 8:43 PM Ahmet Altay >>> > wrote:
 >>I prefer bay are for NA summit. My reasoning is that there is a
 >>criticall mass of contributors and users in that location,
 probably
 >>more than alternative NA locations. I was not involved with
 planning
 >>recently and I do not know if there were people who could attend
 due
 >>to location previously. If that is the case, I agree with Elliotte
 >>on looking for other options.
 >>Related to dates: March (Asia) and mid-May (NA) dates are a bit
 >>close. Mid-June for NA might be better to spread events. Other
 >>pieces looks good.
 >>Ahmet
 >>On Thu, Nov 7, 2019 at 7:09 AM Elliotte Rusty Harold
 >>mailto:elh...@ibiblio.org>> wrote:
 >>The U.S. sadly is not a reliable destination for international
 >>conferences these days. Almost every conference I go to, big
 and
 >>small, has at least one speaker, sometimes more, who can't
 get into
 >>the country. Canada seems worth considering. Vancouver,
 >>Montreal, and
 >>Toronto are all convenient.
 >>On Wed, Nov 6, 2019 at 2:17 PM Griselda Cuevas <
 g...@apache.org
 >>> wrote:
 >> >
 >> > Hi Beam Community!
 >> >
 >> > I'd like to kick off a thread to discuss potential dates
 and
 >>venues for the 2020 Beam Summits.
 >> >
 >> > I did some research on industry conferences happening in
 2020
 >>and pre-selected a few ranges as follows:
 >> >
 >> > (2 days) NA between mid-May and mid-June
 >> > (2 days) EU mid October
 >> > (1 day) Asia Mini Summit:  March
 >> >
 >> > I'd like to hear your thoughts on these dates and get
 >>

Re: [DISCUSS] Autoformat python code with Black

2020-01-21 Thread Chad Dombrova
>
>
> It'd be good if there was a way to only apply to violating (or at
> least changed) lines.


I assumed the first thing we’d do is convert all of the code in one go,
since it’s a very safe operation. Did you have something else in mind?

-chad





>
> On Tue, Jan 21, 2020 at 1:56 PM Chad Dombrova  wrote:
> >
> > +1 to autoformatting
> >
> > Let me add some nuance to that.
> >
> > The way I see it there are 2 varieties of formatters:  those which take
> the original formatting into consideration (autopep8) and those which
> disregard it (yapf, black).
> >
> > I much prefer yapf to black, because you have plenty of options to tweak
> with yapf (enough to make the output a pretty close match to the current
> Beam style), and you can mark areas to preserve the original formatting,
> which could be very useful with Pipeline building with pipe operators.
> Please don't pick black.
> >
> > autopep8 is more along the lines of spotless in Java -- it only corrects
> code that breaks the project's style rules.  The big problem with Beam's
> current style is that it is so esoteric that autopep8 can't enforce it --
> and I'm not just talking about 2-spaces, which I don't really have a
> problem with -- the problem is the use of either 2 or 4 spaces depending on
> context (expression start vs hanging indent, etc).  This is my *biggest*
> gripe about the current style.  PyCharm doesn't have enough control
> either.  So, if we can choose a style that can be expressed by flake8 or
> pycodestyle then we can use autopep8 to enforce it.
> >
> > I'd prefer autopep8 to yapf because I like having a little wiggle room
> to influence the style, but on a big project like Beam all that wiggle room
> ends up to minor but noticeable inconsistencies in style throughout the
> project.  yapf ensures completely consistent style, but the tradeoff is
> that it's sometimes ugly, especially in scenarios with similar repeated
> entries like argparse, where yapf might insert line breaks in visually
> inconsistent and unappealing ways depending on the lengths of the keywords
> and expressions involved.
> >
> > Either way (but especially if we choose yapf) I think it'd be a nice
> addition to setup a pre-commit [1] config so that people can opt in to
> running *lightweight* autofixers prior to commit.  This will not only
> reduce dev frustration but will also reduce the amount of cpu cycles that
> Jenkins spends pointing out lint errors.
> >
> > [1] https://pre-commit.com/
> >
> > -chad
> >
> >
> >
> >
> > On Tue, Jan 21, 2020 at 12:52 PM Ismaël Mejía  wrote:
> >>
> >> Last time we discussed this there seems not to be much progress into
> autoformatting.
> >> This tool looks more tweakable, so maybe it could be more appropriate
> for Beam's use case.
> >> https://github.com/google/yapf/
> >> WDYT?
> >>
> >>
> >> On Thu, May 30, 2019 at 10:50 AM Łukasz Gajowy 
> wrote:
> >>>
> >>> +1 for any autoformatter for Python SDK that does the job. My
> experience is that since spotless in Java SDK I would never start a new
> Java project without it. So many great benefits not only for one person
> coding but for all community.
> >>>
> >>> It is a GitHub UI issue that you cannot easily browse past the
> reformat. It is not actually that hard, but does take a couple extra clicks
> to get GitHub to display blame before a reformat. It is easier with the
> command line. I do a lot of code history digging and the global Java
> reformat is not really a problem.
> >>>
> >>> It's actually one more click on Github but I agree it's not the best
> way to search the history. The most convenient and clear one I've found so
> far is in Jetbrains IDEs (Intelij) where you can:
> >>>
> >>> right click on line number -> "annotate" -> click again -> "annotate
> previous revision" -> ...
> >>>
> >>> You can also use "compare with" to see the diff between two revisions.
> >>>
> >>> Łukasz
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> czw., 30 maj 2019 o 06:15 Kenneth Knowles 
> napisał(a):
> >>>>
> >>>> +1 pending good enough tooling (I can't quite tell - seems there are
> some issues?)
> >>>>
> >>>> On Wed, May 29, 2019 at 2:40 PM Katarzyna Kucharczyk <
> ka.kucharc...@gmail.com> wrote:
> >>>>>
> >>>>> What else actually we gain? My guess is faster PR review iteration.
> We will 

Re: [DISCUSS] Autoformat python code with Black

2020-01-21 Thread Chad Dombrova
+1 to autoformatting

Let me add some nuance to that.

The way I see it there are 2 varieties of formatters:  those which take the
original formatting into consideration (autopep8) and those which disregard
it (yapf, black).

I much prefer yapf to black, because you have plenty of options to tweak
with yapf (enough to make the output a pretty close match to the current
Beam style), and you can mark areas to preserve the original formatting,
which could be very useful with Pipeline building with pipe operators.
Please don't pick black.

autopep8 is more along the lines of spotless in Java -- it only corrects
code that breaks the project's style rules.  The big problem with Beam's
current style is that it is so esoteric that autopep8 can't enforce it --
and I'm not just talking about 2-spaces, which I don't really have a
problem with -- the problem is the use of either 2 or 4 spaces depending on
context (expression start vs hanging indent, etc).  This is my *biggest*
gripe about the current style.  PyCharm doesn't have enough control
either.  So, if we can choose a style that can be expressed by flake8 or
pycodestyle then we can use autopep8 to enforce it.

I'd prefer autopep8 to yapf because I like having a little wiggle room to
influence the style, but on a big project like Beam all that wiggle room
ends up to minor but noticeable inconsistencies in style throughout the
project.  yapf ensures completely consistent style, but the tradeoff is
that it's sometimes ugly, especially in scenarios with similar repeated
entries like argparse, where yapf might insert line breaks in visually
inconsistent and unappealing ways depending on the lengths of the keywords
and expressions involved.

Either way (but especially if we choose yapf) I think it'd be a nice
addition to setup a pre-commit [1] config so that people can opt in to
running *lightweight* autofixers prior to commit.  This will not only
reduce dev frustration but will also reduce the amount of cpu cycles that
Jenkins spends pointing out lint errors.

[1] https://pre-commit.com/

-chad




On Tue, Jan 21, 2020 at 12:52 PM Ismaël Mejía  wrote:

> Last time we discussed this there seems not to be much progress into
> autoformatting.
> This tool looks more tweakable, so maybe it could be more appropriate for
> Beam's use case.
> https://github.com/google/yapf/
> WDYT?
>
>
> On Thu, May 30, 2019 at 10:50 AM Łukasz Gajowy  wrote:
>
>> +1 for any autoformatter for Python SDK that does the job. My experience
>> is that since spotless in Java SDK I would never start a new Java project
>> without it. So many great benefits not only for one person coding but for
>> all community.
>>
>> It is a GitHub UI issue that you cannot easily browse past the reformat.
>> It is not actually that hard, but does take a couple extra clicks to get
>> GitHub to display blame before a reformat. It is easier with the command
>> line. I do a lot of code history digging and the global Java reformat is
>> not really a problem.
>>
>> It's actually one more click on Github but I agree it's not the best way
>> to search the history. The most convenient and clear one I've found so far
>> is in Jetbrains IDEs (Intelij) where you can:
>>
>> right click on line number -> "annotate" -> click again -> "annotate
>> previous revision" -> ...
>>
>> You can also use "compare with" to see the diff between two revisions.
>>
>> Łukasz
>>
>>
>>
>>
>>
>> czw., 30 maj 2019 o 06:15 Kenneth Knowles  napisał(a):
>>
>>> +1 pending good enough tooling (I can't quite tell - seems there are
>>> some issues?)
>>>
>>> On Wed, May 29, 2019 at 2:40 PM Katarzyna Kucharczyk <
>>> ka.kucharc...@gmail.com> wrote:
>>>
 What else actually we gain? My guess is faster PR review iteration. We
 will skip some of conversations about code style.

>>> ...
>>>
 Last but not least, new contributor may be less discouraged. When I
 started contribute I didn’t know how to format my code and I lost a lot of
 time to add pylint and adjust IntelliJ. I eventually failed. Currently I
 write code intuitively and when I don’t forget I rerun tox.

>>>
>>> This is a huge benefit. This is why I supported it so much for Java. It
>>> is a community benefit. You do not have to be a contributor to the Python
>>> SDK to support this. That is why I am writing here. Just eliminate all
>>> discussion of formatting. It doesn't really matter what the resulting
>>> format is, if it is not crazy to read. I strongly oppose maintaining a
>>> non-default format.
>>>
>>> Reformating 20k lines or 200k is not hard. The Java global reformat
>>> touched 50k lines. It does not really matter how big it is. Definitely do
>>> it all at once if you think the tool is good enough. And you should pin a
>>> version, so churn is not a problem. You can upgrade the version and
>>> reformat in a PR later and that is also easy.
>>>
>>> It is a GitHub UI issue that you cannot easily browse past the reformat.
>>> It is not actually that hard, but 

Re: [DISCUSS] Python static type checkers

2020-01-13 Thread Chad Dombrova
>
> > I agree with focusing one mypy for now, but I would propose soon after,
> or in parallel if it will be different folks, to work on pytype and enable
> it as a first class citizen similar to mypy. If there will be a large delta
> between the two then we can decide on what to do next.
>
> If there is a large delta, I wonder if what is needed to provide
> sufficient typing on the public API of beam (needed for users) could
> be a much smaller subset than that required fully
> documenting/typechecking the internals (which is also likely to be
> where more of the tricky bits are).
>

Let's separate our two use cases for type checking:

1) developers working on the Beam source
2) developers using Beam

For #1, there's little to gain from running a second type checker, and as
discussed, it's a burden.  so mypy is the clear winner here.

For #2, there's actually no need to expose all of the internals of Beam to
pytype, and as Robert points out, this is where the "tricky bits" reside.
We could use mpy's stubgen tool to create .pyi files of just the API
interface -- no internals -- as a separate apache_beam_stubs package.  Then
pytype users (or users of other type checkers) could optionally install
this and point pytype at it.  This would be a bit more automatic for end
users if pytype supported pep561[1], which is all about the packaging and
discovery of typed packages, including stub-only packages like I'm
proposing.

https://github.com/google/pytype/issues/151

-chad


Re: [DISCUSS] Python static type checkers

2020-01-13 Thread Chad Dombrova
>
> Pytype seems to detect attribute errors that mypy has not, so it acts as a
> kind-of linter in this case.
> Examples:
>
> https://github.com/apache/beam/pull/10528/files#diff-0cb34b4622b0b7d7256d28b1ee1d52fc
>
> https://github.com/apache/beam/pull/10528/files#diff-7e4ad8c086414399957cdbea711ebd36
>
> https://github.com/apache/beam/pull/10528/files#diff-d5c3f4f603204c5c5917d89e90dba53d
> (it also makes pytype more strict in a sense)
>

Note that mypy is still not fully passing on master so it's unclear from
those diffs exactly how the two tools differ.  Many of the fixes you've
made for pytype look familiar to me from mypy, but my fixes may not be
merged yet.  For example, mypy also does not support @total_ordering, but
my fix for that is still pending.


Re: [DISCUSS] Python static type checkers

2020-01-12 Thread Chad Dombrova
Hi folks,
I agree with Robert that we need to wait and see before making any
decisions, but I do have some opinions about the probable/desired outcome.

I haven't used pytype, but my experience working with mypy over the past
few years -- and following various issues and peps related to it and typing
in general -- has taught me there's still a lot of room for interpretation
and thus variation between type checkers.

Here's a simple example: ignoring errors.  Both tools support ignoring
errors using a `type: ignore` comment, but only mypy (to my knowledge)
supports specifying an error type so that only that error is suppressed,
e.g. `type: ignore[error-code-here]`.   There's even room for differences
with regard to the line number where the error is emitted and thus where
the ignore comment must be placed (end of statement, site of open paren,
site of close paren, etc).  I know this because mypy has actually made
adjustments to this once or twice over the years, which necessitated moving
existing ignore comments.  So just imagine having to ignore the same error
separately for each type checker.  It's not the end of the world, but it's
ugly and frustrating.

As a user, it can be quite challenging to solve certain typing issues, and
there's a fairly steep learning curve –  I wouldn't want to burden users
with *two* type checker, each with its own idiosyncrasies.  That said, a
linter that doesn't actually prevent merges when an error occurs will be
ignored by users and quickly become less-than-useful.  Post-commit would
not be a good idea for all the reasons that a post-commit lint check would
be annoying (user's will trip it often and feel surprised/blind-sided).

In the little exposure that I've had with pytype it seems to lag behind
mypy in terms of features, especially wrt typing-related peps (it never
fully supported pep484 multi-line type comments and it still doesn't
support pep561, I see no mention of pep589/TypedDict in the docs, but then
again they are *incredibly* light).  I've gotten mypy completely passing,
and I know it very well, so I'm pretty biased towards making it the one and
only type checker that generates pre-commit errors.  I see little advantage
to most end users in supporting pytype, except y'know, Google has kind of
an important presence in Apache Beam project  :)

Some quick pypi download figures to back that up:

Downloads last month:
pytype: 24,864
mypy: 1,502,582

So to sum up this email in a sentence: running mypy in pre-commit checks
would be beneficial, but making pytype also pass would be a burden with
little benefit for the majority of users.

But as I said at the outset, before we make any decisions we should get at
least one type checker in place and start getting feedback, because we're
still in the land of conjecture.

Hopefully I'll have all my typing changes merged in the next month or two,
at which point we can discuss enabling it as part of the pre-commit lint
job.

-chad




On Tue, Jan 7, 2020 at 7:02 PM Udi Meiri  wrote:

> Hi,
> We recently added mypy to the Jenkins Lint job for PRs (currently ignores
> errors). Mypy is a static type checker.
>
> There's a JIRA for adding another static type checker named pytype
> https://issues.apache.org/jira/browse/BEAM-9064
>
> I wanted to ask the community their thoughts on this. (see JIRA issue
> comments as well)
>
> - Should PRs have to pass more than 1 static type checker? (in pre-commit
> tests)
> - If not, should the remaining type checkers be run as a post-commit tests?
> - How much effort should be put into supporting more than 1 type checker?
> (i.e. making sure that they all pass)
>
>
>
>


Re: Root logger configuration

2019-12-16 Thread Chad Dombrova
On Mon, Dec 16, 2019 at 5:59 PM Pablo Estrada  wrote:

> +chad...@gmail.com  is this consistent with behavior
> that you observed?
>

I honestly can't recall, sorry.  I just remember that while I was testing I
updated sdk version and some logging stopped.  I *think* I was missing the
state/message stream, which would be on the client side after pipeline
construction.

-chad


Re: RFC: python static typing PR

2019-12-12 Thread Chad Dombrova
Thanks for the kind words, everyone.

Note that the PR that was merged was the first half, so the mypy-lint job
is not yet setup to trigger failures.

Part 2 is up now:  https://github.com/apache/beam/pull/10367

My goal is to bundle up changes into smaller PRs from here on out.  It
might take another 3 PRs to get through the rest.

-chad


On Wed, Dec 11, 2019 at 2:13 PM Ahmet Altay  wrote:

> Thank you, Chad! This is awesome.
>
> On Wed, Dec 11, 2019 at 2:05 PM Robert Bradshaw 
> wrote:
>
>> +1. Thanks!
>>
>> On Wed, Dec 11, 2019 at 1:44 PM Pablo Estrada  wrote:
>> >
>> > Love it. I've merged it, since it was already approved by Robert - and
>> yes, we don't want to hit a merge conflict.
>> >
>> > On Wed, Dec 11, 2019 at 1:36 PM Heejong Lee  wrote:
>> >>
>> >> Wow, this is a big step forward. As a long fan of strongly typed
>> functional languages, I'm glad to see this change :)
>> >>
>> >> On Wed, Dec 11, 2019 at 9:44 AM Chad Dombrova 
>> wrote:
>> >>>
>> >>> Hi all,
>> >>> Robert has diligently reviewed the first batch of changes for this
>> PR, and all review notes are addressed and tests are passing:
>> https://github.com/apache/beam/pull/9915
>> >>>
>> >>> Due to the number of file touched there's a short window of about one
>> or two days before a merge conflict arrives on master, and after resolving
>> that it usually takes another 1-2 days of pasting "Run Python PreCommit"
>> until they pass again, so it would be great to get this merged while the
>> window is open!  Despite the number of files touched, the changes are
>> almost entirely type comments, so the PR is designed to be quite safe.
>> >>>
>> >>> -chad
>> >>>
>> >>>
>> >>> On Tue, Nov 5, 2019 at 2:50 PM Chad Dombrova 
>> wrote:
>> >>>>
>> >>>> Glad to hear we have such a forward-thinking community!
>> >>>>
>> >>>>
>> >>>> On Tue, Nov 5, 2019 at 2:43 PM Robert Bradshaw 
>> wrote:
>> >>>>>
>> >>>>> Sounds like we have consensus. Let's move forward. I'll follow up
>> with
>> >>>>> the discussions on the PRs themselves.
>> >>>>>
>> >>>>> On Wed, Oct 30, 2019 at 2:38 PM Robert Bradshaw <
>> rober...@google.com> wrote:
>> >>>>> >
>> >>>>> > On Wed, Oct 30, 2019 at 1:26 PM Chad Dombrova 
>> wrote:
>> >>>>> > >
>> >>>>> > >> Do you believe that a future mypy plugin could replace
>> pipeline type checks in Beam, or are there limits to what it can do?
>> >>>>> > >
>> >>>>> > > mypy will get us quite far on its own once we completely
>> annotate the beam code.  That said, my PR does not include my efforts to
>> turn PTransforms into Generics, which will be required to properly analyze
>> pipelines, so there's still a lot more work to do.  I've experimented with
>> a mypy plugin to smooth over some of the rough spots in that workflow and I
>> will just say that the mypy API has a very steep learning curve.
>> >>>>> > >
>> >>>>> > > Another thing to note: mypy is very explicit about function
>> annotations.  It does not do the "implicit" inference that Beam does, such
>> as automatically detecting function return types.  I think it should be
>> possible to do a lot of that as a mypy plugin, and in fact, since it has
>> little to do with Beam it could grow into its own project with outside
>> contributors.
>> >>>>> >
>> >>>>> > Yeah, I don't think, as is, it can replace what we do, but with
>> >>>>> > plugins I think it could possibly come closer. Certainly there is
>> >>>>> > information that is only available at runtime (e.g. reading from a
>> >>>>> > database or avro/parquet file could provide the schema which can
>> be
>> >>>>> > used for downstream checking) which may limit the ability to do
>> >>>>> > everything statically (even Beam Java is moving this direction).
>> Mypy
>> >>>>> > clearly has an implementation of the "is compatible with" operator
>> >>>>> > that I would love to borrow, but unfortunately it's not (easily?)
>> >>>>> > exposed.
>> >>>>> >
>> >>>>> > That being said, we should leverage what we can for pipeline
>> >>>>> > authoring, and it'll be a great development too regardless.
>>
>


Re: Cython unit test suites running without Cythonized sources

2019-12-11 Thread Chad Dombrova
>
> IIUC, isolated_build=True and the removal of setup.py invocation in the
> current virtualenv should eliminate any Cython output files in the repo,
> and no need for run_tox_cleanup.sh?
>

Correct, that script is deleted in this commit:
https://github.com/apache/beam/pull/10038/commits/c6dab09abf9f4091f0dbf7eac2964c5be0665763


Re: RFC: python static typing PR

2019-12-11 Thread Chad Dombrova
Hi all,
Robert has diligently reviewed the first batch of changes for this PR, and
all review notes are addressed and tests are passing:
https://github.com/apache/beam/pull/9915

Due to the number of file touched there's a short window of about one or
two days before a merge conflict arrives on master, and after resolving
that it usually takes another 1-2 days of pasting "Run Python PreCommit"
until they pass again, so it would be great to get this merged while the
window is open!  Despite the number of files touched, the changes are
almost entirely type comments, so the PR is designed to be quite safe.

-chad


On Tue, Nov 5, 2019 at 2:50 PM Chad Dombrova  wrote:

> Glad to hear we have such a forward-thinking community!
>
>
> On Tue, Nov 5, 2019 at 2:43 PM Robert Bradshaw 
> wrote:
>
>> Sounds like we have consensus. Let's move forward. I'll follow up with
>> the discussions on the PRs themselves.
>>
>> On Wed, Oct 30, 2019 at 2:38 PM Robert Bradshaw 
>> wrote:
>> >
>> > On Wed, Oct 30, 2019 at 1:26 PM Chad Dombrova 
>> wrote:
>> > >
>> > >> Do you believe that a future mypy plugin could replace pipeline type
>> checks in Beam, or are there limits to what it can do?
>> > >
>> > > mypy will get us quite far on its own once we completely annotate the
>> beam code.  That said, my PR does not include my efforts to turn
>> PTransforms into Generics, which will be required to properly analyze
>> pipelines, so there's still a lot more work to do.  I've experimented with
>> a mypy plugin to smooth over some of the rough spots in that workflow and I
>> will just say that the mypy API has a very steep learning curve.
>> > >
>> > > Another thing to note: mypy is very explicit about function
>> annotations.  It does not do the "implicit" inference that Beam does, such
>> as automatically detecting function return types.  I think it should be
>> possible to do a lot of that as a mypy plugin, and in fact, since it has
>> little to do with Beam it could grow into its own project with outside
>> contributors.
>> >
>> > Yeah, I don't think, as is, it can replace what we do, but with
>> > plugins I think it could possibly come closer. Certainly there is
>> > information that is only available at runtime (e.g. reading from a
>> > database or avro/parquet file could provide the schema which can be
>> > used for downstream checking) which may limit the ability to do
>> > everything statically (even Beam Java is moving this direction). Mypy
>> > clearly has an implementation of the "is compatible with" operator
>> > that I would love to borrow, but unfortunately it's not (easily?)
>> > exposed.
>> >
>> > That being said, we should leverage what we can for pipeline
>> > authoring, and it'll be a great development too regardless.
>>
>


Re: Cython unit test suites running without Cythonized sources

2019-12-11 Thread Chad Dombrova
Hi Udi,

> Sorry I didn't realize you already had a solution for the shadowing issue
> and BEAM-8572.
>

No worries at all.  I haven't had much time to invest into that PR lately
(most of it I did at home on my own time), but I did get past most of the
major issues.  You've been working on so many of the same problems I was
trying to solve there, and so far you've been coming to the same
conclusions independently (e.g. removing pytest-runner and
setup_requires).   It's great to have that validation, and it's helped
reduce the scope of my PR.  Moving forward, I would love to team up on
this.  Happy to answer any questions you have about the approach I took.

-chad


Re: Cython unit test suites running without Cythonized sources

2019-12-10 Thread Chad Dombrova
Hi Udi, I know you're aware of my PR
<https://github.com/apache/beam/pull/10038>, but I really encourage you to
look into pep517 and pep518.  They are the new solution for all of this --
declaring build dependencies and creating isolated out-of-source builds
e.g. using tox.  Another thing I added to my PR is something which asserts
that cython is *or is not* enabled based on an env var set in the tox
config.  In other words, we're currently skipping tests when cython is not
installed or when code is not cythonized, but we're not asserting whether
we *expect* that code to be cythonized or not.  Also there are a few bugs
where cython tests don't run at all because we're identifying the wrong
module name to check for cythonization ("apache_beam.coders" which is a
package).

-chad


On Tue, Dec 10, 2019 at 6:03 PM Udi Meiri  wrote:

> To follow up, since I'm trying to run cython-based tests using pytest:
> - tox does in fact correctly install apache-beam with cythonized modules
> in its virtualenv.
> - Since our tests are under apache_beam/, local sources shadow those in
> the installed apache_beam package.
> - The original issue I raised (BEAM-8572
> <https://issues.apache.org/jira/browse/BEAM-8572>) was due to bad usage
> of utils.check_compiled().
>
> So I'm going to add an additional "python setup.py build_ext --inplace" to
> pyXX-cython-pytest envs.
> If we want to remove this additional step we'll probably have to move
> tests under a separate directory that is not part of the apache_beam
> package.
>
> See also:
> https://github.com/tox-dev/tox/issues/514
> https://blog.ionelmc.ro/2014/05/25/python-packaging/#the-structure
>
> On Mon, Nov 11, 2019 at 3:21 PM Ahmet Altay  wrote:
>
>> Thank you for spending time on this to clarify it for all of us! Much
>> appreciated.
>>
>> On Sun, Nov 10, 2019 at 3:45 PM Chad Dombrova  wrote:
>>
>>> Hi all,
>>>
>>>
>>>> The sdist step creates a package that should be installed into each
>>>> tox environment. If the tox environment has cython when this apache
>>>> beam package is installed, it should be used. Nose (or whatever)
>>>> should then run the tests.
>>>>
>>> I spent some time this weekend trying to understand the Beam python
>>> build process, and here’s an overview of what I’ve learned:
>>>
>>>- the :sdks:python:sdist gradle task creates the source tarball (no
>>>surprises there)
>>>   - the protobuf stubs are generated during this process
>>>- the sdist is provided to tox, which installs it into the the
>>>virtualenv for that task
>>>- for *-cython tasks, tox installs the cython dep and, as Ahmet
>>>asserted, python setup.py nosetests performs the cythonization.
>>>   - this cythonized build overrides the one installed by tox
>>>
>>> Here’s what I learned about the current status of tests wrt cython:
>>>
>>>- cython tox tasks *are* using cython (good!)
>>>- non-cython tox tasks *are not* using cython (good!)
>>>- none of the GCP or integration tests are using cython (bad?)
>>>
>>> This is intentional with the recent change to drop base only tests.
>> Otherwise we would not have coverage for non-cythonized tests.
>>
>>>
>>>- This is because the build is only cythonized when python setup.py
>>>   nosetests is used in conjunction with tox (tox installs cython, python
>>>   setup.py nosetests compiles it).
>>>   - GCP tests don't install cython.  ITs don't use tox.
>>>
>>> To confirm my understanding of this, I created a PR [1] to assert that a
>>> cythonized or pure-python build is being used.  A cythonized build is
>>> expected by default on linux systems unless a special flag is provided to
>>> inform the test otherwise.  It appears as though the portable tests passed
>>> (i.e. used cython), but I forgot to add the assertion for those; like the
>>> other ITs they are not using cython.
>>>
>>> *Questions:*
>>>
>>>- Is the lack of cython for ITs expected and/or desired?
>>>- Why aren't ITs using tox?  It's quite possible to pass arguments
>>>into tox to control it's behavior.  For example, it seems reasonable that
>>>run_integration_test.sh could be inside tox
>>>
>>>
>> For ITs the primary execution will happen on a remote system. As long as
>> those remote workers have cython installed the tests will run in a
>> cythonized environment. This is true for any IT tests, that runs in a
>> container based on the Dockerfile we have in beam (which installs cython),
>> and also true for legacy Dataflow environments that are not yet using this
>> Beam provided Dockerfile.
>>
>>
>>>
>>> *Next Steps:*There has been some movement in the python community to
>>> solve problems around build dependencies [2] and toolchains [3].  I hope to
>>> have a proposal for how to simplify this process soon.
>>>
>> Thank you!
>>
>> [1] https://github.com/apache/beam/pull/10058
>>> [2] https://www.python.org/dev/peps/pep-0517/
>>> [3] https://www.python.org/dev/peps/pep-0518/
>>>
>>> -chad
>>>
>>


Re: Python: pytest migration update

2019-12-09 Thread Chad Dombrova
After this PR goes in should we revisit breaking up the python tests into
separate jenkins jobs by python version?  One of the problems with that
plan originally was that we lost the parallelism that gradle provides
because we were left with only one tox task per jenkins job, and so the
total time to complete all python jenkins jobs went up a lot.  With
pytest + xdist we should hopefully be able to keep the parallelism even
with just one tox task.  This could be a big win.  I feel like I'm spending
more time monitoring and re-queuing timed-out jenkins jobs lately than I am
writing code.

On Mon, Dec 9, 2019 at 10:32 AM Udi Meiri  wrote:

> This PR  (in review) migrates
> py27-gcp to using pytest.
> It reduces the testPy2Gcp task down to ~13m
> 
> (from ~45m). This speedup will probably be lower once all 8 tasks are using
> pytest.
> It also adds 5 previously uncollected tests.
>


Re: Python staging file weirdness

2019-12-05 Thread Chad Dombrova
On Thu, Dec 5, 2019 at 12:36 PM Valentyn Tymofieiev 
wrote:

Ah nice, so then the workflow would be: download [missing] deps from pypi
> into a long-lived cache directory, then download copy the same deps into
> a short-lived temporary directory, using  long-lived cache directory as
> SoT, then stage files from a short-lived temporary directory and clean it
> up. Is that what you are suggesting, Chad?
>
Yes, I just did a quick test to confirm:

# download or build wheels of anything that's missing from the cache
# note: we're including gcp extras:
pip wheel apache_beam[gcp]==2.16 --wheel-dir /tmp/wheel-cache
# copy some of those wheels somewhere else
# note: we're excluding gcp extras
pip download apache_beam==2.16 --no-binary
--find-links=/tmp/wheel-cache --dest /tmp/wheel-dest/
# rerun to confirm that cached wheels are being re-used instead of
downloaded from pypi
pip wheel apache_beam[gcp]==2.16 --wheel-dir /tmp/wheel-cache

/tmp/wheel-dest/ will now have a subset of the deps from /tmp/wheel-cache,
excluding the gcp extras.

Note that for some reason the equal sign after —find-links is required, at
least for me on pip 19.1.1. Using a space resulted in an error.

-chad


Re: Python staging file weirdness

2019-12-05 Thread Chad Dombrova
Another way to copy only the deps you care about is to use `pip download`
to do the copy.  I believe you can provide the cache dir to `pip download
--find-links` and it will read from that before reading from pypi (you may
also need to set --wheel-dir to the cache dir as well), and thus it acts as
a simple copy.

-chad


On Thu, Dec 5, 2019 at 12:07 PM Valentyn Tymofieiev 
wrote:

> Looked for a bit at pip download command. The alternative seems to parse
> the output of
>
> python -m pip download  --dest . -r requirements.txt  --exists-action i
> --no-binary :all:
>
> and see which files were downloaded and/or skipped since they were already
> present, and then stage only the files that appear in the log output. Seems
> doable but may break if pip output changes between pip implementations, so
> we'd have to add a test as well.
>
> On Thu, Dec 5, 2019 at 11:10 AM Luke Cwik  wrote:
>
>> I think reusing the same cache directory makes sense during downloading
>> but why do we upload everything that is there?
>>
>> On Thu, Dec 5, 2019 at 9:24 AM Udi Meiri  wrote:
>>
>>> Looking at the source, it seems that it should be using a
>>> os.path.join(tempfile.gettempdir(), 'dataflow-requirements-cache')
>>> to create a different tmp directory on each run.
>>>
>>> Also, sampling worker no. 2:
>>>
>>> *jenkins@apache-beam-jenkins-2*:*~*$ ls -l /tmp/dataflow-requirements-cache/
>>> total 7172
>>> -rw-rw-r-- 1 jenkins jenkins  27947 Sep  6 22:46 *funcsigs-1.0.2.tar.gz*
>>> -rw-rw-r-- 1 jenkins jenkins  28126 Sep  6 21:38 *mock-3.0.5.tar.gz*
>>> -rw-rw-r-- 1 jenkins jenkins 376623 Sep  6 21:38 *PyHamcrest-1.9.0.tar.gz*
>>> -rw-rw-r-- 1 jenkins jenkins 851251 Sep  6 21:38 *setuptools-41.2.0.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 855608 Oct  7 06:03 *setuptools-41.4.0.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 851068 Oct 28 06:10 *setuptools-41.5.0.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 851097 Oct 28 19:46 *setuptools-41.5.1.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 852541 Oct 29 14:06 *setuptools-41.6.0.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 852125 Nov 24 08:10 *setuptools-42.0.0.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 852264 Nov 25 20:55 *setuptools-42.0.1.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 858444 Dec  1 18:12 *setuptools-42.0.2.zip*
>>> -rw-rw-r-- 1 jenkins jenkins  32725 Sep  6 21:38 *six-1.12.0.tar.gz*
>>> -rw-rw-r-- 1 jenkins jenkins  33726 Nov  5 19:18 *six-1.13.0.tar.gz*
>>>
>>>
>>> On Wed, Dec 4, 2019 at 8:00 PM Luke Cwik  wrote:
>>>
 Can we filter the cache directory only for the artifacts that we want
 and not everything that is there?

 On Wed, Dec 4, 2019 at 6:56 PM Valentyn Tymofieiev 
 wrote:

> Luke, I am not sure I understand the question. The caching that
> happens here is implemented in the SDK for requirements packages:
> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
>
>
> On Wed, Dec 4, 2019 at 6:19 PM Luke Cwik  wrote:
>
>> Is there a way to use a cache on disk that is separate from the set
>> of packages we use as requirements?
>>
>> On Wed, Dec 4, 2019 at 5:58 PM Udi Meiri  wrote:
>>
>>> Thanks!
>>> Another reason to periodically referesh workers.
>>>
>>> On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
 Tests job specify[1] a requirements.txt file that contains two
 entries: pyhamcrest, mock.

 We download[2]  sources of packages specified in requirements file,
 and packages they depend on. While doing so, it appears that we use a 
 cache
 directory on jenkins to store the sources of the packages [3], perhaps 
 to
 save a trip to pypi and reduce pypi flakiness? Then, we stage the 
 entire
 cache directory[4], which includes all packages ever cached. Overtime 
 the
 versions that our requirements packages need change, but I guess we 
 don't
 clean the cache on Jenkins workers.

 [1]
 https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197
 [2]
 https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469
 [3]
 https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161

 [4]
 https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172

 On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri 
 wrote:

> I was investigating a Dataflow postcommit test failure
> (endpoints_pb2 missing), and saw this in the staging directory:
>
> $ gsutil ls 
> 

Re: cython test instability

2019-11-26 Thread Chad Dombrova
yeah, I've excised both test_requires and setup_requires in my test
simplification PR:  https://github.com/apache/beam/pull/10038

I'm happy to see those go sooner rather than later, as it'll reduce the
scope of my PR.  The rest of my PR is about ensuring that build
dependencies like cython and grpc are available at "build" time (i.e. when
setup.py gets called), and the modern solution for this is a
pep517/518-compliant build system, of which tox is one.

-chad



On Tue, Nov 26, 2019 at 6:39 PM Udi Meiri  wrote:

> I'm not sure about where the error with the simplegeneric, timeloop .eggs
> directories come from,
> but I did figure out that they don't get installed as eggs if you add them
> to the "test" extras in setup.py, e.g.:
>
> extras_require={
> 'docs': ['Sphinx>=1.5.2,<2.0'],
> 'test': REQUIRED_TEST_PACKAGES + INTERACTIVE_BEAM,
> 'gcp': GCP_REQUIREMENTS,
> 'interactive': INTERACTIVE_BEAM,
> },
>
> This is further proof of the wisdom of the pytest-runner deprecation
> notice <https://pypi.org/project/pytest-runner/> (emphasis mine):
> """
> Remove ‘pytest’ and any other testing requirements from ‘*tests_require*’,
> preferably removing the setup_requires option.
> """
>
> I believe we don't rely on the tests_require definition. Removing it might
> break developers running "python setup.py test", but the alternative is a
> simple "setup.py && pip install".
>
>
> On Tue, Nov 26, 2019 at 5:14 PM Chad Dombrova  wrote:
>
>> Sorry wrong link:  https://github.com/apache/beam/pull/9915
>>
>>
>>
>> On Tue, Nov 26, 2019 at 5:12 PM Udi Meiri  wrote:
>>
>>> I looked at #9959 but it doesn't seem to modify setup.py?
>>> The additional eggs for timeloop etc. are troubling though. Not sure
>>> where those come from.
>>>
>>> On Tue, Nov 26, 2019 at 4:59 PM Chad Dombrova  wrote:
>>>
>>>> Is setup_requires being used somewhere else, because I'm still getting
>>>> errors after removing it from sdks/python/setup.py.
>>>>
>>>> I removed it from this PR: https://github.com/apache/beam/pull/9959
>>>>
>>>> Here's the gradle scan:
>>>> https://scans.gradle.com/s/oinh5xpaly3dk/failure#top=0
>>>>
>>>> The error shows up differently than before when
>>>> setup_requries=['pytest-runner'] was present -- it's in a gradle traceback
>>>> now rather than the console log.  I've also seen different packages listed
>>>> as the culprit (simplegeneric, timeloop).
>>>>
>>>> -chad
>>>>
>>>>
>>>>
>>>> On Tue, Nov 26, 2019 at 4:47 PM Udi Meiri  wrote:
>>>>
>>>>> Chad, I believe the answer is the "setup_requires" line is causing the
>>>>> sdks/python/.eggs directory to be created.
>>>>>
>>>>> This command fails with the setup_requires line (same Errno 17), but
>>>>> succeeds without it:
>>>>> $ \rm -r .eggs/; ../../gradlew installGcpTest
>>>>> [~8 failed tasks]
>>>>> $ ls .eggs
>>>>> pytest_runner-5.2-py2.7.egg  pytest_runner-5.2-py3.5.egg
>>>>>  pytest_runner-5.2-py3.6.egg  pytest_runner-5.2-py3.7.egg  README.txt
>>>>>
>>>>> I'll go ahead and create a PR to remove setup_requires from setup.py.
>>>>>
>>>>> On Tue, Nov 26, 2019 at 4:16 PM Chad Dombrova 
>>>>> wrote:
>>>>>
>>>>>> It seems like the offending packages are those that only have source
>>>>>> distributions (i.e. no wheels).  But why are the eggs being installed in
>>>>>> sdks/python/.eggs instead of into the virtualenv created by 
>>>>>> setupVirtualenv
>>>>>> gradle task or by tox?
>>>>>>
>>>>>>
>>>>>> On Tue, Nov 26, 2019 at 3:59 PM Udi Meiri  wrote:
>>>>>>
>>>>>>> Basically, I believe what's happening is that a new Gradle task was
>>>>>>> added that uses setup.py but doesn't have the same dependency on some 
>>>>>>> main
>>>>>>> setup.py task that all others depend on (list sdist).
>>>>>>>
>>>>>>> On Tue, Nov 26, 2019 at 3:49 PM Udi Meiri  wrote:
>>>>>>>
>>>>>>>> Correction: the error is not gone after removing the line. I get
>>>>>>>> instead:
>>>>>>>> error: [Errno 1

Re: cython test instability

2019-11-26 Thread Chad Dombrova
Sorry wrong link:  https://github.com/apache/beam/pull/9915



On Tue, Nov 26, 2019 at 5:12 PM Udi Meiri  wrote:

> I looked at #9959 but it doesn't seem to modify setup.py?
> The additional eggs for timeloop etc. are troubling though. Not sure where
> those come from.
>
> On Tue, Nov 26, 2019 at 4:59 PM Chad Dombrova  wrote:
>
>> Is setup_requires being used somewhere else, because I'm still getting
>> errors after removing it from sdks/python/setup.py.
>>
>> I removed it from this PR: https://github.com/apache/beam/pull/9959
>>
>> Here's the gradle scan:
>> https://scans.gradle.com/s/oinh5xpaly3dk/failure#top=0
>>
>> The error shows up differently than before when
>> setup_requries=['pytest-runner'] was present -- it's in a gradle traceback
>> now rather than the console log.  I've also seen different packages listed
>> as the culprit (simplegeneric, timeloop).
>>
>> -chad
>>
>>
>>
>> On Tue, Nov 26, 2019 at 4:47 PM Udi Meiri  wrote:
>>
>>> Chad, I believe the answer is the "setup_requires" line is causing the
>>> sdks/python/.eggs directory to be created.
>>>
>>> This command fails with the setup_requires line (same Errno 17), but
>>> succeeds without it:
>>> $ \rm -r .eggs/; ../../gradlew installGcpTest
>>> [~8 failed tasks]
>>> $ ls .eggs
>>> pytest_runner-5.2-py2.7.egg  pytest_runner-5.2-py3.5.egg
>>>  pytest_runner-5.2-py3.6.egg  pytest_runner-5.2-py3.7.egg  README.txt
>>>
>>> I'll go ahead and create a PR to remove setup_requires from setup.py.
>>>
>>> On Tue, Nov 26, 2019 at 4:16 PM Chad Dombrova  wrote:
>>>
>>>> It seems like the offending packages are those that only have source
>>>> distributions (i.e. no wheels).  But why are the eggs being installed in
>>>> sdks/python/.eggs instead of into the virtualenv created by setupVirtualenv
>>>> gradle task or by tox?
>>>>
>>>>
>>>> On Tue, Nov 26, 2019 at 3:59 PM Udi Meiri  wrote:
>>>>
>>>>> Basically, I believe what's happening is that a new Gradle task was
>>>>> added that uses setup.py but doesn't have the same dependency on some main
>>>>> setup.py task that all others depend on (list sdist).
>>>>>
>>>>> On Tue, Nov 26, 2019 at 3:49 PM Udi Meiri  wrote:
>>>>>
>>>>>> Correction: the error is not gone after removing the line. I get
>>>>>> instead:
>>>>>> error: [Errno 17] File exists:
>>>>>> '/usr/local/google/home/ehudm/src/beam/sdks/python/.eggs/dill-0.3.1.1-py2.7.egg'
>>>>>>
>>>>>>
>>>>>> On Tue, Nov 26, 2019 at 3:45 PM Udi Meiri  wrote:
>>>>>>
>>>>>>> I managed to recreate one of the issues with this command:
>>>>>>> ~/src/beam/sdks/python$ \rm -r .eggs/ && for i in $(seq 2); do echo
>>>>>>> "python setup.py -q nosetests --tests
>>>>>>> apache_beam.pipeline_test:DoFnTest.test_incomparable_default &" | sh ; 
>>>>>>> done
>>>>>>>
>>>>>>> This reliably gives me:
>>>>>>> OSError: [Errno 17] File exists:
>>>>>>> '/usr/local/google/home/ehudm/src/beam/sdks/python/.eggs/pytest_runner-5.2-py2.7.egg'
>>>>>>>
>>>>>>> If I remove this line from setup.py the error is gone:
>>>>>>>   setup_requires=['pytest_runner'],
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Nov 26, 2019 at 2:54 PM Chad Dombrova 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for looking into this. It seems like it might be something
>>>>>>>> to do with data that is cached on the Jenkins slaves between runs, 
>>>>>>>> which
>>>>>>>> may be what prevents this from showing up locally?
>>>>>>>>
>>>>>>>> If your theory about setuptools is correct, and it sounds likely,
>>>>>>>> we should be able to lock down the version, which we should definitely 
>>>>>>>> be
>>>>>>>> doing for all of our dependencies.
>>>>>>>>
>>>>>>>> -chad
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Nov 26, 2019 at 1:

Re: cython test instability

2019-11-26 Thread Chad Dombrova
Is setup_requires being used somewhere else, because I'm still getting
errors after removing it from sdks/python/setup.py.

I removed it from this PR: https://github.com/apache/beam/pull/9959

Here's the gradle scan:
https://scans.gradle.com/s/oinh5xpaly3dk/failure#top=0

The error shows up differently than before when
setup_requries=['pytest-runner'] was present -- it's in a gradle traceback
now rather than the console log.  I've also seen different packages listed
as the culprit (simplegeneric, timeloop).

-chad



On Tue, Nov 26, 2019 at 4:47 PM Udi Meiri  wrote:

> Chad, I believe the answer is the "setup_requires" line is causing the
> sdks/python/.eggs directory to be created.
>
> This command fails with the setup_requires line (same Errno 17), but
> succeeds without it:
> $ \rm -r .eggs/; ../../gradlew installGcpTest
> [~8 failed tasks]
> $ ls .eggs
> pytest_runner-5.2-py2.7.egg  pytest_runner-5.2-py3.5.egg
>  pytest_runner-5.2-py3.6.egg  pytest_runner-5.2-py3.7.egg  README.txt
>
> I'll go ahead and create a PR to remove setup_requires from setup.py.
>
> On Tue, Nov 26, 2019 at 4:16 PM Chad Dombrova  wrote:
>
>> It seems like the offending packages are those that only have source
>> distributions (i.e. no wheels).  But why are the eggs being installed in
>> sdks/python/.eggs instead of into the virtualenv created by setupVirtualenv
>> gradle task or by tox?
>>
>>
>> On Tue, Nov 26, 2019 at 3:59 PM Udi Meiri  wrote:
>>
>>> Basically, I believe what's happening is that a new Gradle task was
>>> added that uses setup.py but doesn't have the same dependency on some main
>>> setup.py task that all others depend on (list sdist).
>>>
>>> On Tue, Nov 26, 2019 at 3:49 PM Udi Meiri  wrote:
>>>
>>>> Correction: the error is not gone after removing the line. I get
>>>> instead:
>>>> error: [Errno 17] File exists:
>>>> '/usr/local/google/home/ehudm/src/beam/sdks/python/.eggs/dill-0.3.1.1-py2.7.egg'
>>>>
>>>>
>>>> On Tue, Nov 26, 2019 at 3:45 PM Udi Meiri  wrote:
>>>>
>>>>> I managed to recreate one of the issues with this command:
>>>>> ~/src/beam/sdks/python$ \rm -r .eggs/ && for i in $(seq 2); do echo
>>>>> "python setup.py -q nosetests --tests
>>>>> apache_beam.pipeline_test:DoFnTest.test_incomparable_default &" | sh ; 
>>>>> done
>>>>>
>>>>> This reliably gives me:
>>>>> OSError: [Errno 17] File exists:
>>>>> '/usr/local/google/home/ehudm/src/beam/sdks/python/.eggs/pytest_runner-5.2-py2.7.egg'
>>>>>
>>>>> If I remove this line from setup.py the error is gone:
>>>>>   setup_requires=['pytest_runner'],
>>>>>
>>>>>
>>>>> On Tue, Nov 26, 2019 at 2:54 PM Chad Dombrova 
>>>>> wrote:
>>>>>
>>>>>> Thanks for looking into this. It seems like it might be something to
>>>>>> do with data that is cached on the Jenkins slaves between runs, which may
>>>>>> be what prevents this from showing up locally?
>>>>>>
>>>>>> If your theory about setuptools is correct, and it sounds likely, we
>>>>>> should be able to lock down the version, which we should definitely be
>>>>>> doing for all of our dependencies.
>>>>>>
>>>>>> -chad
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Nov 26, 2019 at 1:33 PM Ahmet Altay  wrote:
>>>>>>
>>>>>>> I tried to debug but did not make much progress. I cannot reproduce
>>>>>>> locally, however all python precommits and postcommits are failing.
>>>>>>>
>>>>>>> One guess is, setuptools released a new version that does not
>>>>>>> support eggs a few days ago, that might be the cause (
>>>>>>> https://github.com/pypa/setuptools/blob/master/CHANGES.rst) but
>>>>>>> that should have reproduced locally.
>>>>>>> Maybe something is wrong with the jenkins machines, and we could
>>>>>>> perhaps bring them to a clean state.
>>>>>>>
>>>>>>> I suspected this being related to pytest somehow (as the first 4
>>>>>>> JIRAs had pytest in the error line) but the error Chad saw is different.
>>>>>>>
>>>>>>> +Valentyn Tymofieiev  and +Yifan Zou
>>>>>>&g

Re: cython test instability

2019-11-26 Thread Chad Dombrova
It seems like the offending packages are those that only have source
distributions (i.e. no wheels).  But why are the eggs being installed in
sdks/python/.eggs instead of into the virtualenv created by setupVirtualenv
gradle task or by tox?


On Tue, Nov 26, 2019 at 3:59 PM Udi Meiri  wrote:

> Basically, I believe what's happening is that a new Gradle task was added
> that uses setup.py but doesn't have the same dependency on some main
> setup.py task that all others depend on (list sdist).
>
> On Tue, Nov 26, 2019 at 3:49 PM Udi Meiri  wrote:
>
>> Correction: the error is not gone after removing the line. I get instead:
>> error: [Errno 17] File exists:
>> '/usr/local/google/home/ehudm/src/beam/sdks/python/.eggs/dill-0.3.1.1-py2.7.egg'
>>
>>
>> On Tue, Nov 26, 2019 at 3:45 PM Udi Meiri  wrote:
>>
>>> I managed to recreate one of the issues with this command:
>>> ~/src/beam/sdks/python$ \rm -r .eggs/ && for i in $(seq 2); do echo
>>> "python setup.py -q nosetests --tests
>>> apache_beam.pipeline_test:DoFnTest.test_incomparable_default &" | sh ; done
>>>
>>> This reliably gives me:
>>> OSError: [Errno 17] File exists:
>>> '/usr/local/google/home/ehudm/src/beam/sdks/python/.eggs/pytest_runner-5.2-py2.7.egg'
>>>
>>> If I remove this line from setup.py the error is gone:
>>>   setup_requires=['pytest_runner'],
>>>
>>>
>>> On Tue, Nov 26, 2019 at 2:54 PM Chad Dombrova  wrote:
>>>
>>>> Thanks for looking into this. It seems like it might be something to do
>>>> with data that is cached on the Jenkins slaves between runs, which may be
>>>> what prevents this from showing up locally?
>>>>
>>>> If your theory about setuptools is correct, and it sounds likely, we
>>>> should be able to lock down the version, which we should definitely be
>>>> doing for all of our dependencies.
>>>>
>>>> -chad
>>>>
>>>>
>>>>
>>>> On Tue, Nov 26, 2019 at 1:33 PM Ahmet Altay  wrote:
>>>>
>>>>> I tried to debug but did not make much progress. I cannot reproduce
>>>>> locally, however all python precommits and postcommits are failing.
>>>>>
>>>>> One guess is, setuptools released a new version that does not support
>>>>> eggs a few days ago, that might be the cause (
>>>>> https://github.com/pypa/setuptools/blob/master/CHANGES.rst) but that
>>>>> should have reproduced locally.
>>>>> Maybe something is wrong with the jenkins machines, and we could
>>>>> perhaps bring them to a clean state.
>>>>>
>>>>> I suspected this being related to pytest somehow (as the first 4 JIRAs
>>>>> had pytest in the error line) but the error Chad saw is different.
>>>>>
>>>>> +Valentyn Tymofieiev  and +Yifan Zou
>>>>>  could you help with looking into this?
>>>>>
>>>>>
>>>>> Ahmet
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Nov 26, 2019 at 9:14 AM Luke Cwik  wrote:
>>>>>
>>>>>> I also started to see this on PRs that I'm reviewing.
>>>>>> BEAM-8793, BEAM-8653, BEAM-8631, BEAM-8249 mention issues with setup.py 
>>>>>> and
>>>>>> egg_info but this looks different then all of those so I filed BEAM-8831.
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 25, 2019 at 10:27 PM Chad Dombrova 
>>>>>> wrote:
>>>>>>
>>>>>>> Actually, it looks like I'm getting the same error on multiple PRs:
>>>>>>> https://scans.gradle.com/s/ihfmrxr7evslw
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Nov 25, 2019 at 10:26 PM Chad Dombrova 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>> The cython tests started failing on one of my PRs which were
>>>>>>>> succeeding before.   The error is one that I've never seen before
>>>>>>>> (separated onto different lines to make it easier to read):
>>>>>>>>
>>>>>>>> Caused by: org.gradle.api.GradleException:
>>>>>>>> Could not copy file
>>>>>>>>
>>>>>>>> '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
>>>>>>>> /src/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'
>>>>>>>> to
>>>>>>>>
>>>>>>>> '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
>>>>>>>> /src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'.
>>>>>>>>
>>>>>>>> Followed immediately by an error about could not create a directory
>>>>>>>> of the same name.  Here's the gradle scan:
>>>>>>>>
>>>>>>>>
>>>>>>>> https://scans.gradle.com/s/ihfmrxr7evslw/failure?openFailures=WzFd=WzZd#top=0
>>>>>>>>
>>>>>>>> Any ideas?
>>>>>>>>
>>>>>>>> -chad
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>


Re: cython test instability

2019-11-26 Thread Chad Dombrova
Thanks for looking into this. It seems like it might be something to do
with data that is cached on the Jenkins slaves between runs, which may be
what prevents this from showing up locally?

If your theory about setuptools is correct, and it sounds likely, we should
be able to lock down the version, which we should definitely be doing for
all of our dependencies.

-chad



On Tue, Nov 26, 2019 at 1:33 PM Ahmet Altay  wrote:

> I tried to debug but did not make much progress. I cannot reproduce
> locally, however all python precommits and postcommits are failing.
>
> One guess is, setuptools released a new version that does not support eggs
> a few days ago, that might be the cause (
> https://github.com/pypa/setuptools/blob/master/CHANGES.rst) but that
> should have reproduced locally.
> Maybe something is wrong with the jenkins machines, and we could perhaps
> bring them to a clean state.
>
> I suspected this being related to pytest somehow (as the first 4 JIRAs had
> pytest in the error line) but the error Chad saw is different.
>
> +Valentyn Tymofieiev  and +Yifan Zou
>  could you help with looking into this?
>
>
> Ahmet
>
>
>
> On Tue, Nov 26, 2019 at 9:14 AM Luke Cwik  wrote:
>
>> I also started to see this on PRs that I'm reviewing.
>> BEAM-8793, BEAM-8653, BEAM-8631, BEAM-8249 mention issues with setup.py and
>> egg_info but this looks different then all of those so I filed BEAM-8831.
>>
>>
>> On Mon, Nov 25, 2019 at 10:27 PM Chad Dombrova  wrote:
>>
>>> Actually, it looks like I'm getting the same error on multiple PRs:
>>> https://scans.gradle.com/s/ihfmrxr7evslw
>>>
>>>
>>>
>>>
>>> On Mon, Nov 25, 2019 at 10:26 PM Chad Dombrova 
>>> wrote:
>>>
>>>> Hi all,
>>>> The cython tests started failing on one of my PRs which were succeeding
>>>> before.   The error is one that I've never seen before (separated onto
>>>> different lines to make it easier to read):
>>>>
>>>> Caused by: org.gradle.api.GradleException:
>>>> Could not copy file
>>>> '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
>>>> /src/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'
>>>> to
>>>> '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
>>>> /src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'.
>>>>
>>>> Followed immediately by an error about could not create a directory of
>>>> the same name.  Here's the gradle scan:
>>>>
>>>>
>>>> https://scans.gradle.com/s/ihfmrxr7evslw/failure?openFailures=WzFd=WzZd#top=0
>>>>
>>>> Any ideas?
>>>>
>>>> -chad
>>>>
>>>>
>>>>
>>>>
>>>>


Re: cython test instability

2019-11-25 Thread Chad Dombrova
Actually, it looks like I'm getting the same error on multiple PRs:
https://scans.gradle.com/s/ihfmrxr7evslw




On Mon, Nov 25, 2019 at 10:26 PM Chad Dombrova  wrote:

> Hi all,
> The cython tests started failing on one of my PRs which were succeeding
> before.   The error is one that I've never seen before (separated onto
> different lines to make it easier to read):
>
> Caused by: org.gradle.api.GradleException:
> Could not copy file
> '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
> /src/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'
> to
> '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
> /src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'.
>
> Followed immediately by an error about could not create a directory of the
> same name.  Here's the gradle scan:
>
>
> https://scans.gradle.com/s/ihfmrxr7evslw/failure?openFailures=WzFd=WzZd#top=0
>
> Any ideas?
>
> -chad
>
>
>
>
>


cython test instability

2019-11-25 Thread Chad Dombrova
Hi all,
The cython tests started failing on one of my PRs which were succeeding
before.   The error is one that I've never seen before (separated onto
different lines to make it easier to read):

Caused by: org.gradle.api.GradleException:
Could not copy file
'/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
/src/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'
to
'/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
/src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'.

Followed immediately by an error about could not create a directory of the
same name.  Here's the gradle scan:

https://scans.gradle.com/s/ihfmrxr7evslw/failure?openFailures=WzFd=WzZd#top=0

Any ideas?

-chad


Re: [discuss] Using a logger hierarchy in Python

2019-11-19 Thread Chad Dombrova
Pablo, it might be necessary to setup a root logging handler if one does
not exist already.  I noticed that a LocalJobServicer that I was testing
against stopped emitting tracebacks when I rebased onto the latest from
master.  Setting up the root handler fixed it.  I'm still testing this, and
I might be misinterpreting what I saw, but I wanted to get eyes on it in
case I don't have time to get a definitive answer.

-chad



On Fri, Nov 15, 2019 at 4:30 PM Pablo Estrada  wrote:

> Thanks all,
> 2/3 of PRs are merged (using _LOGGER). It should be pretty easy to
> switch the variable name to _log via sed.
> Best
> -P.
>
> On Fri, Nov 15, 2019 at 2:08 PM Kyle Weaver  wrote:
>
>> +1 for per-module loggers (what Robert said).
>>
>> On Fri, Nov 15, 2019 at 1:48 PM Udi Meiri  wrote:
>>
>>> +1, but can we use something less verbose and shift key heavy than
>>> _LOGGER like log or _log?
>>>
>>> Also please dedupe with these existing bugs:
>>> https://issues.apache.org/jira/browse/BEAM-3523
>>> https://issues.apache.org/jira/browse/BEAM-1825
>>>
>>> On Thu, Nov 14, 2019 at 8:02 AM Thomas Weise  wrote:
>>>
>>>> Awesome, thanks Chad!
>>>>
>>>> On Wed, Nov 13, 2019 at 10:26 PM Chad Dombrova 
>>>> wrote:
>>>>
>>>>> Hi Thomas,
>>>>>
>>>>>
>>>>>> Will this include the ability for users to configure logging via
>>>>>> pipeline options?
>>>>>>
>>>>>
>>>>> We're working on a proposal to allow pluggable logging handlers that
>>>>> can be configured via pipeline options.  For example, it would allow you 
>>>>> to
>>>>> add a new logging handler for StackDriver or Elasticsearch.  Will 
>>>>> hopefully
>>>>> have a document to share soon.
>>>>>
>>>>> -chad
>>>>>
>>>>>


Re: Improve container support

2019-11-19 Thread Chad Dombrova
Hi all,
I think it might be good to update the description of the beam docker
images and add some descriptive tags, because searching for "apache beam"
in docker hub does not turn up anything:
https://hub.docker.com/search?q=apache%20beam=image.

I clicked through 10 pages worth and couldn't find it.  Maybe I missed
something, but it clearly shouldn't be this hard.  I did eventually manage
to find it through the docs.

Also, googling "apache beam python docker" also does not yield anything
useful.  In fact, it turns up an *unofficial* apache beam docker hub image.

One thing I noticed is that images designated as "Official Images" get
listed first, so it would be good to get that done as well.

thanks!
chad




On Fri, Sep 6, 2019 at 1:21 PM Hannah Jiang  wrote:

> Hi team
>
> I haven't received any objections, so will proceed with settings mentioned
> in a previous email.
>
> A reminder to PMC members, please let me know your docker hub id if you
> want to be an admin.
>
> Thanks,
> Hannah
>
> On Thu, Sep 5, 2019 at 5:02 PM Ankur Goenka  wrote:
>
>> Please ignore the previous email. I was looking at the older document in
>> the mail thread.
>>
>> On Thu, Sep 5, 2019 at 4:58 PM Ankur Goenka  wrote:
>>
>>> I think sdk in the name is obsolete as they are all under sdks name
>>> space.
>>>
>>> On Thu, Sep 5, 2019 at 3:26 PM Hannah Jiang 
>>> wrote:
>>>
 Hi Team

 Thanks for all the comments about beam containers.
 After considering various opinions and investigating gcr and docker
 hub, we decided to push images to docker hub.

 Each image will have two tags, {version}_rc and {version}. {version}
 tag will be added after the release candidate image is verified.
 Meanwhile, we will have* latest* tag for each repository, which always
 points to the most recent verified release image, so users can pull it by
 default.

 Docker hub doesn't support leveled repository, which means we should
 follow *repository:tag* format.
 it's too general if we use {language_version} as repository for SDK
 images. (version is added when we support multiple versions.)
 So I would like to include *sdk* to repository. Images generated at
 local will also have the same name.
 Here are some examples:

- python2.7_sdk:2.15.0
- java_sdk:2.15.0_rc
- go_sdk:latest

 I will proceed with this format if there is no strong opposition by
 tomorrow noon(PST).

 *To PMC members*:
 Permission control will follow the pypi model. All interested PMC
 members will be added as admins and release managers will be granted push
 permission.
 Please let me know your *docker id* if you want to be added as an
 admin.

 Thanks,
 Hannah








 On Wed, Sep 4, 2019 at 3:47 PM Thomas Weise  wrote:

> This will greatly simplify trying out portable runners:
> https://beam.apache.org/documentation/runners/flink/#executing-a-beam-pipeline-on-a-flink-cluster
>
> Can't wait for following to disappear from the instructions page: 
> ./gradlew
> :sdks:python:container:docker
>
> On Wed, Sep 4, 2019 at 3:35 PM Thomas Weise  wrote:
>
>> Awesome, thank you!
>>
>>
>> On Wed, Sep 4, 2019 at 3:22 PM Hannah Jiang 
>> wrote:
>>
>>> Hi Thomas
>>>
>>> I created snapshot images from head as of around 2PM today.
>>> You can pull images from
>>> gcr.io/apache-beam-testing/beam/sdks/snapshot.
>>>
>>> Thanks,
>>> Hannah
>>>
>>> On Wed, Sep 4, 2019 at 1:41 PM Thomas Weise  wrote:
>>>
 Hi Hannah,

 Thank you, I know how to build the containers locally, but not how
 to publish them!

 The cwiki says "Publishing images to gcr.io/beam requires
 permissions in apache-beam-testing project."

 Can I get access to the testing project (at least temporarily) and
 what would I need to setup to run the publish target that is shown on 
 cwiki?

 Thanks,
 Thomas


 On Wed, Sep 4, 2019 at 11:06 AM Hannah Jiang <
 hannahji...@google.com> wrote:

> Hi Thomas
>
> I haven't uploaded any snapshot images yet. Here is how you can
> create one from head.
> > cd [...]/beam/
> # For Python
> > ./gradlew :sdks:python:container:py{version}:docker *where
> version is {2,35,36,37}*
> # For Java
> > ./gradlew -p sdks/java/container docker
> # For Go
> > ./gradlew -p sdks/go/container docker
>
> The 2.15 one is just for testing, not a real 2.15.0, nor a
> snapshot from head.
>
> Please let me know if you have any questions.
> Hannah
>
> On Wed, Sep 4, 2019 at 10:57 AM Thomas Weise 
> wrote:
>

Re: [discuss] Using a logger hierarchy in Python

2019-11-13 Thread Chad Dombrova
Hi Thomas,


> Will this include the ability for users to configure logging via pipeline
> options?
>

We're working on a proposal to allow pluggable logging handlers that can be
configured via pipeline options.  For example, it would allow you to add a
new logging handler for StackDriver or Elasticsearch.  Will hopefully have
a document to share soon.

-chad


Re: [discuss] Using a logger hierarchy in Python

2019-11-13 Thread Chad Dombrova
On Wed, Nov 13, 2019 at 10:52 AM Robert Bradshaw 
wrote:

> I would be in favor of using module-level loggers as well.


+1


Re: Questions about the current and future design of the job service message stream

2019-11-10 Thread Chad Dombrova
Hi,

> You can see that each JobMessagesResponse may contain a message *or* a
>> GetJobStateResponse.
>>
>> What’s the intention behind this design?
>>
> I believe this was because a user may want to listen to both job state and
> messages all in one stream.
>

Just to be crystal clear, what's the advantage of using a single stream
versus two?

> The reason this is important to me is I’d like to make a handful of
>> changes to GetMessageStream to make it more powerful:
>>
>>- propagate messages from user code (if they opt in to setting up
>>their logger appropriately). currently, AFAICT, the only message the
>>message stream delivers is a final error, if the job fails (other than
>>state changes). It was clearly the original intent of this endpoint to
>>carry other types of messages, and I'd like to bring that to fruition.
>>
>> Log messages is a lot of data, we do have users writing GBs/s when
> aggregated across all their machines in Google Cloud so not sure if this
> will scale without a lot of control on filtering. Users sometimes don't
> recognize how much they are logging and if you have a 1000 VMs each writing
> only a few lines at a time you can easily saturate this stream.
>

Yes, we're concerned about message volume as well.  The plan would be to
add filters, which could be propagated from the job server to the logger on
the runner and sdk (if they support it) to avoid over-saturating the
stream.  For example, the log-level right now is basically ERROR, so we'd
propagate that to the runner and it would only send error messages back to
the job server.  Thus, we should hopefully be able to roll out this feature
without much change to the end user.  They could then opt-in to higher
volume message levels, if desired.

Some possible filters could be:

   - job id (required)
   - log level (default=ERROR)
   - transform id(s) (optional. defaults to just runner messages)
   - a jsonpath  selector for
   filtering on message metadata?

I think a logging implementation would consist of 2 parts:  the logging
service (i.e. an implementation of GetMessageStream) and the logging
handler for emitting messages from the runner and optionally user
transforms.  Handlers would need to be implemented for each SDK (i.e.
language).

The default logging implementation would consist of the InMemoryJobService
on the servicer side, which would send the filter object to the handler.
 The handler would pre-filter messages and stream them back to the standard
job service, which would simply forward on everything it receives, as it
does now.

A StackDriver logging service would be a bit different.  Its logging
handler might send *everything* to StackDriver so that there's a complete
record that can be sifted through later.  Its servicer component would
interpret the filter object into a StackDriver filter string and create a
subscription with StackDriver.

In this way we could support both semi-persistent logging services with a
queryable history (like StackDriver) and completely transient message
streams like we have now.

>
>>- make it possible to back GetMessageStream with logging services
>>like StackDriver, CloudWatch, or Elasticsearch
>>
>> That is interesting, originally the message stream was designed around
> system messages from the runner and not specifically around users log
> messages due to volume concerns. All logging integration to my knowledge
> has been deferred to the client libraries for those specific services.
>

What we're after is a user experience akin to what the Dataflow UI
provides: view a pipeline, open the log console, and view recent messages
from the runner.  click on a transform to view messages emitted by that
transform.  We've found Flink's logging and log UI to be sorely lacking and
we like the idea of tackling this problem at the Beam level, especially
considering so much of what we want is already there in some form.

Another use case that I think would benefit from this is providing custom
progress messages to users who launch a batch job from a shell, since the
message stream is already emitted there.   Of course, you'd have to be
careful about message volume, but as I mentioned there would be 2 levels
where you'd need to opt in:

   - changing log level from its default (ERROR)
   - setting up transform-level logging


-chad


Re: Cython unit test suites running without Cythonized sources

2019-11-10 Thread Chad Dombrova
Hi all,


> The sdist step creates a package that should be installed into each
> tox environment. If the tox environment has cython when this apache
> beam package is installed, it should be used. Nose (or whatever)
> should then run the tests.
>
I spent some time this weekend trying to understand the Beam python build
process, and here’s an overview of what I’ve learned:

   - the :sdks:python:sdist gradle task creates the source tarball (no
   surprises there)
  - the protobuf stubs are generated during this process
   - the sdist is provided to tox, which installs it into the the
   virtualenv for that task
   - for *-cython tasks, tox installs the cython dep and, as Ahmet
   asserted, python setup.py nosetests performs the cythonization.
  - this cythonized build overrides the one installed by tox

Here’s what I learned about the current status of tests wrt cython:

   - cython tox tasks *are* using cython (good!)
   - non-cython tox tasks *are not* using cython (good!)
   - none of the GCP or integration tests are using cython (bad?)
  - This is because the build is only cythonized when python setup.py
  nosetests is used in conjunction with tox (tox installs cython, python
  setup.py nosetests compiles it).
  - GCP tests don't install cython.  ITs don't use tox.

To confirm my understanding of this, I created a PR [1] to assert that a
cythonized or pure-python build is being used.  A cythonized build is
expected by default on linux systems unless a special flag is provided to
inform the test otherwise.  It appears as though the portable tests passed
(i.e. used cython), but I forgot to add the assertion for those; like the
other ITs they are not using cython.

*Questions:*

   - Is the lack of cython for ITs expected and/or desired?
   - Why aren't ITs using tox?  It's quite possible to pass arguments into
   tox to control it's behavior.  For example, it seems reasonable that
   run_integration_test.sh could be inside tox


*Next Steps:*There has been some movement in the python community to solve
problems around build dependencies [2] and toolchains [3].  I hope to have
a proposal for how to simplify this process soon.

[1] https://github.com/apache/beam/pull/10058
[2] https://www.python.org/dev/peps/pep-0517/
[3] https://www.python.org/dev/peps/pep-0518/

-chad


Re: Cython unit test suites running without Cythonized sources

2019-11-07 Thread Chad Dombrova
Hi,
Answers inline below,

It's unclear from the nose source[1] whether it's calling build_py
>> and build_ext, or just build_ext.  It's also unclear whether the result of
>> that build is actually used.  When python setup.py nosetests runs, it runs
>> inside of a virtualenv created by tox, and tox has already installed beam
>> into that venv.  It seems unlikely to me that build_ext or build_py is
>> going to install over top of the beam package installed by tox, but who
>> knows, it may end up first on the path.  Also recall that there is an sdist
>> gradle task running before tox that creates a tarball that is passed to
>> run_tox.sh which passes it along to tox --installpkg flag[2] so that tox
>> won't build beam itself.
>>
>
> I believe the build step is executed as expected and during installation
> results in cythonized package to be installed. This could be verified by,
> in a new virtual environment creating a source distribution, installing
> cython, then installing the source distribution. Resulting installation
> does have the .so files. This is done before running nosetests.
>

Even if it *is* working, I think it's a pretty poor design that we build it
once in sdist and then rebuild it again with nose.  It's very obfuscated
and brittle, hence we're still debating the probable outcome.  We should
choose one place to build and that should either be the sdist gradle task
or tox, not the test command.


> We should designate a single place that always does the build.  I thought
>> that was supposed to be the gradle sdist task, but invoking nose via
>> `python setup.py` means that we're introducing the possibility that a build
>> is occurring which may or may not be used, depending on the entirely
>> unclear dependencies of the setup commands, and the entirely unclear
>> relationship of that build output to the tox venv.  As a first step of
>> clarity, we could stop invoking nose using `python setup.py nosetests`, and
>> instead use `nosetests` (and in the future `pytest`).  I think the reason
>> for `python setup.py nosetest` is to ensure the test requirements are
>> installed,
>>
>
> I believe the reason we are invokign nosetest this way is related to the
> beam testing plugin. It is configure in setup.py. The behavior is
> documented here: https://nose.readthedocs.io/en/latest/api/commands.html
>

It is possible to register a custom plugin without using setup.py:
https://nose.readthedocs.io/en/latest/plugins/writing.html#registering-a-plugin-without-setuptools

Since we're on the verge of switching to pytest, perhaps we should
investigate switching that over to to not use setup.py instead of chasing
our tails with nose.


> but we could shift those to a separate requirements file or use
>> extra_requires and tox can ensure that they're installed.  I find these two
>> options to be pretty common patterns [3].
>>
>
> We do use extras is tox already. GCP tests work this way by installing
> additional GCP package. In my opinion, letting tox to setup the virtual
> environment either from the package or from setup.py is a better option
> than using requirements file. Otherwise we would need a way to keep
> setup.py and requirements file in sync.
>

Oh yeah, I see that the tests already are an extra package.  Well, that'll
make it that much easier to stop using `python setup.py nosetests`.

-chad


Re: Cython unit test suites running without Cythonized sources

2019-11-07 Thread Chad Dombrova
On Thu, Nov 7, 2019 at 11:31 AM Robert Bradshaw  wrote:

> Does python setup.py nosetests invoke build_ext (or, more generally,
> build)?


It's unclear from the nose source[1] whether it's calling build_py
and build_ext, or just build_ext.  It's also unclear whether the result of
that build is actually used.  When python setup.py nosetests runs, it runs
inside of a virtualenv created by tox, and tox has already installed beam
into that venv.  It seems unlikely to me that build_ext or build_py is
going to install over top of the beam package installed by tox, but who
knows, it may end up first on the path.  Also recall that there is an sdist
gradle task running before tox that creates a tarball that is passed to
run_tox.sh which passes it along to tox --installpkg flag[2] so that tox
won't build beam itself.

We should designate a single place that always does the build.  I thought
that was supposed to be the gradle sdist task, but invoking nose via
`python setup.py` means that we're introducing the possibility that a build
is occurring which may or may not be used, depending on the entirely
unclear dependencies of the setup commands, and the entirely unclear
relationship of that build output to the tox venv.  As a first step of
clarity, we could stop invoking nose using `python setup.py nosetests`, and
instead use `nosetests` (and in the future `pytest`).  I think the reason
for `python setup.py nosetest` is to ensure the test requirements are
installed, but we could shift those to a separate requirements file or use
extra_requires and tox can ensure that they're installed.  I find these two
options to be pretty common patterns [3].

[1] https://github.com/nose-devs/nose/blob/master/nose/commands.py#L113
[2]
https://tox.readthedocs.io/en/latest/config.html#cmdoption-tox-installpkg
[3]
https://stackoverflow.com/questions/29870629/pip-install-test-dependencies-for-tox-from-setup-py



> It's possible cython is present, but the build step is not
> invoked which would explain the skip for slow_coders_test. The correct
> test is being used in
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/fast_coders_test.py#L34
>
> On Thu, Nov 7, 2019 at 11:20 AM Ahmet Altay  wrote:
> >
> > I believe tox is correctly installing cython and executes "python
> setup.py nosetests" which triggers cythonzation path inside setup.py. Some
> indications that cython is installed and used is the following log entries
> (from a recent precommit cron job [1])
> > - [ 1/12] Cythonizing apache_beam/coders/coder_impl.py
> > - Errors with cython.cast in the stack traces.
> > - Tests skipped with: test_using_slow_impl
> (apache_beam.coders.slow_coders_test.SlowCoders) ... SKIP: Found cython,
> cannot test non-compiled implementation.
> >
> > At the same time there are log entries as following:
> > - test_using_fast_impl (apache_beam.coders.fast_coders_test.FastCoders)
> ... SKIP: Cython is not installed
> >
> > It might be an issue with what these tests are suing to check whether
> they are cythonized or not. We seem to have at least 2 different versions
> of this check [2][3]. Maybe we need to converge on one (former?).
> >
> > Ahmet
> >
> > [1]
> https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_PreCommit_Python_Cron/2008/consoleFull
> > [2]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/slow_coders_test.py#L32
> > [3]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/tools/utils.py#L33
> >
> >
> >
> > On Wed, Nov 6, 2019 at 6:19 PM Chad Dombrova  wrote:
> >>
> >> Another potential solution would be to _not_ use the sdist task to
> build the tarball and let tox do it.  Tox should install cython on
> supported platforms before running sdist itself (which it does by default
> unless you explicitly provide it with a tarball, which we are doing).  This
> has the added benefit of one less virtualenv.  Right now we create a
> virtualenv to build the sdist tarball, then we create another virtualenv to
> run tox, then tox creates a virtualenv to run the task.   It's unclear (to
> me) whether the tarball is rebuilt for each tox task or if it's reused.
> >
> >
> > I do not know if not passing the tarball will solve the issue. I tried
> this and ran into the same problem.
> >
> > I agree that we can get rid of setup virtualenv task if it is not adding
> value.
> >
> >>
> >>
> >> -chad
> >>
> >>
> >> On Wed, Nov 6, 2019 at 6:13 PM Udi Meiri  wrote:
> >>>
> >>> I opened this bug today after commenting on Chad's type hints PR.
> >>> https://issues.apache.org/jira/browse/BEAM-8572?filter=-1
> >

Re: Cython unit test suites running without Cythonized sources

2019-11-06 Thread Chad Dombrova
Another potential solution would be to _not_ use the sdist task to build
the tarball and let tox do it.  Tox should install cython on supported
platforms before running sdist itself (which it does by default unless you
explicitly provide it with a tarball, which we are doing).  This has the
added benefit of one less virtualenv.  Right now we create a virtualenv to
build the sdist tarball, then we create another virtualenv to run tox, then
tox creates a virtualenv to run the task.   It's unclear (to me) whether
the tarball is rebuilt for each tox task or if it's reused.

-chad


On Wed, Nov 6, 2019 at 6:13 PM Udi Meiri  wrote:

> I opened this bug today after commenting
>  on
> Chad's type hints PR.
> https://issues.apache.org/jira/browse/BEAM-8572?filter=-1
>
> I am 95% sure that our Precommit tests are using tarballs that are built
> without Cython (including the Cython tasks).
>
> I'm NOT currently working on fixing this. One solution might be to add an
> additional task (sdistCython) and tell gradle that sdist and the new task
> should not run concurrently.
> Does anyone want to take this up?
>


Re: Python Beam pipelines on Flink on Kubernetes

2019-11-01 Thread Chad Dombrova
Hi Thomas,
Do you have an example Dockerfile demonstrating best practices for building
an image that contains both Flink and Beam SDK dependencies?  That would be
useful.

-chad


On Fri, Nov 1, 2019 at 10:18 AM Thomas Weise  wrote:

> For folks looking to run Beam on Flink on k8s, see update in [1]
>
> I also updated [2]
>
> TLDR: at this time best option to run portable pipelines on k8s is to
> create container images that have both Flink and the SDK dependencies.
>
> I'm curious how much interest there is to use the official SDK container
> images and keep Flink and portable pipeline separate as far as the image
> build goes? Deployment can be achieved with the sidecar container approach.
> Most of mechanics are in place already, one addition would be an
> abstraction for where the SDK entry point executes (process, external),
> similar to how we have it for the workers.
>
> Thanks,
> Thomas
>
> [1]
> https://lists.apache.org/thread.html/10aada9c200d4300059d02e4baa47dda0cc2c6fc9432f950194a7f5e@%3Cdev.beam.apache.org%3E
>
> [2]
> https://docs.google.com/document/d/1z3LNrRtr8kkiFHonZ5JJM_L4NWNBBNcqRc_yAf6G0VI
>
> On Wed, Aug 21, 2019 at 9:44 PM Thomas Weise  wrote:
>
>> The changes to containerize the Python SDK worker pool are nearly
>> complete. I also updated the document for next implementation steps.
>>
>> The favored approach (initially targeted option) for pipeline submission
>> is support for the (externally created) fat far. It will keep changes to
>> the operator to a minimum and is applicable for any other Flink job as well.
>>
>> Regarding fine grained resource scheduling, that would happen within the
>> pods scheduled by the k8s operator (or other external orchestration tool)
>> or, further down the road, in a completely elastic/dynamic fashion with
>> active execution mode (where Flink would request resources directly from
>> k8s, similar to how it would work on YARN).
>>
>>
>> On Tue, Aug 13, 2019 at 10:59 AM Chad Dombrova  wrote:
>>
>>> Hi Thomas,
>>> Nice work!  It's really clearly presented.
>>>
>>> What's the current favored approach for pipeline submission?
>>>
>>> I'm also interested to know how this plan overlaps (if at all) with the
>>> work on Fine-Grained Resource Scheduling [1][2] that's being done for Flink
>>> 1.9+, which has implications for creation of task managers in kubernetes.
>>>
>>> [1]
>>> https://docs.google.com/document/d/1h68XOG-EyOFfcomd2N7usHK1X429pJSMiwZwAXCwx1k/edit#heading=h.72szmms7ufza
>>> [2] https://issues.apache.org/jira/browse/FLINK-12761
>>>
>>> -chad
>>>
>>>
>>> On Tue, Aug 13, 2019 at 9:14 AM Thomas Weise  wrote:
>>>
>>>> There have been a few comments in the doc and I'm going to start
>>>> working on the worker execution part.
>>>>
>>>> Instead of running a Docker container for each worker, the preferred
>>>> option is to use the worker pool:
>>>>
>>>>
>>>> https://docs.google.com/document/d/1z3LNrRtr8kkiFHonZ5JJM_L4NWNBBNcqRc_yAf6G0VI/edit#heading=h.bzs6y6ms0898
>>>>
>>>> There are some notes at the end of the doc regarding the
>>>> implementation. It would be great if those interested in the environments
>>>> and Python docker container could take a look. In a nutshell, the proposal
>>>> is to make few changes to the Python container so that it (optionally) can
>>>> be used to run the worker pool.
>>>>
>>>> Thanks,
>>>> Thomas
>>>>
>>>>
>>>> On Wed, Jul 24, 2019 at 9:42 PM Pablo Estrada 
>>>> wrote:
>>>>
>>>>> I am very happy to see this. I'll take a look, and leave my comments.
>>>>>
>>>>> I think this is something we'd been needing, and it's great that you
>>>>> guys are putting thought into it. Thanks!<3
>>>>>
>>>>> On Wed, Jul 24, 2019 at 9:01 PM Thomas Weise  wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Recently Lyft open sourced *FlinkK8sOperator,* a Kubernetes operator
>>>>>> to manage Flink deployments on Kubernetes:
>>>>>>
>>>>>> https://github.com/lyft/flinkk8soperator/
>>>>>>
>>>>>> We are now discussing how to extend this operator to also support
>>>>>> deployment of Python Beam pipelines with the Flink runner. I would like 
>>>>>> to
>>>>>> share the proposal with the Beam community to enlist feedback as well as
>>>>>> explore opportunities for collaboration:
>>>>>>
>>>>>>
>>>>>> https://docs.google.com/document/d/1z3LNrRtr8kkiFHonZ5JJM_L4NWNBBNcqRc_yAf6G0VI/
>>>>>>
>>>>>> Looking forward to your comments and suggestions!
>>>>>>
>>>>>> Thomas
>>>>>>
>>>>>>


Re: RFC: python static typing PR

2019-10-30 Thread Chad Dombrova
> Do you believe that a future mypy plugin could replace pipeline type
> checks in Beam, or are there limits to what it can do?
>

mypy will get us quite far on its own once we completely annotate the beam
code.  That said, my PR does not include my efforts to turn PTransforms
into Generics, which will be required to properly analyze pipelines, so
there's still a lot more work to do.  I've experimented with a mypy plugin
to smooth over some of the rough spots in that workflow and I will just say
that the mypy API has a very steep learning curve.

Another thing to note: mypy is very explicit about function annotations.
It does not do the "implicit" inference that Beam does, such as
automatically detecting function return types.  I *think* it should be
possible to do a lot of that as a mypy plugin, and in fact, since it has
little to do with Beam it could grow into its own project with outside
contributors.

-chad


Re: RFC: python static typing PR

2019-10-30 Thread Chad Dombrova
>
> As Beam devs will be gaining more first-hand experience with the tooling,
> we may need to add a style guide/best practices/FAQ to our contributor
> guide to clarify known issues.
>

I'm happy to help out with that, just let me know.

-chad


Re: Quota issues again

2019-10-29 Thread Chad Dombrova
> +1 for splitting pre-commit tests into smaller modules. However in this
> case we need to run all the small tests periodically and have some combined
> flag or dashboard for regular monitoring. Otherwise we might not run/check
> on big amount of tests.
>

post-commit seems like the best place for that, no?


Re: (mini-doc) Beam (Flink) portable job templates

2019-10-28 Thread Chad Dombrova
Thanks for the follow up, Thomas.

On Mon, Oct 28, 2019 at 7:55 PM Thomas Weise  wrote:

> Follow-up for users looking to run portable pipelines on Flink:
>
> After prototyping the generate-jar-file approach for internal deployment
> and some related discussion, the conclusion was that it is too limiting.
> The sticky point is that the jar file would need to be generated at
> container build time. That does not allow us to execute any logic in the
> Python driver program that depends on the deploy environment, such as
> retrieval of environment variables for configuration/credentials, setting a
> submission timestamp for stream positioning etc.
>
> What worked well was that no job server was required to submit the Flink
> job and the jar file could be used with the existing Flink tooling; there
> was no need to change the FlinkK8sOperator
>  at all.
>
> I then looked for a way to eliminate the build time translation and
> execute the Python driver program when the job is submitted, but still as a
> Flink entry point w/o extra job server deployment and client side
> dependencies. How can that work?
>
> https://issues.apache.org/jira/browse/BEAM-8471
>
> The main point was that there should be no requirement to install things
> on the client. FlinkK8sOperator is talking to the Flink REST API, w/o
> Python or Java. The Python dependencies need to be present on the Flink job
> manager host at the time the job is started through the REST API. That was
> something we had already solved for our container image build, and from
> conversation with few other folks this was their preferred container build
> approach also.
>
> In the future we may seek the ability to separate Flink and
> SDK/application bits into different images. For the SDK worker, this is
> intended via the external environment and sidecar container. For the client
> driver program, a similar approach could be implemented. Through an
> "external client environment", instead of a local process execution.
>
> The new Flink runner can be used as entry point for the REST API, the
> Flink CLI or standalone, especially for Flink centric automation. Of course
> portable pipelines can also be directly submitted through the SDK language
> client, via job server or other tooling, like the Python Flink client that
> Robert contributed recently.
>
> Thanks,
> Thomas
>
>
> On Thu, Aug 22, 2019 at 12:58 PM Kyle Weaver  wrote:
>
>> Following up on discussion in this morning's OSS runners meeting, I have
>> uploaded a draft PR for the full implementation (job creation + execution):
>> https://github.com/apache/beam/pull/9408
>>
>> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
>>
>>
>> On Tue, Aug 20, 2019 at 1:24 PM Robert Bradshaw 
>> wrote:
>>
>>> The point of expansion services is to run at pipeline construction
>>> time so that the caller can build on top of the outputs. E.g. we're
>>> hoping to expose Beam's SQL transforms to other languages via an
>>> expansion service and *not* duplicate the logic of parsing the SQL
>>> statements to determine the type(s) of the outputs. Even for simpler
>>> IOs, we would like to take advantage of schema information (e.g.
>>> looked up at construction time) to produce results and validate (or
>>> even inform) subsequent construction.
>>>
>>> I think we're also making a mistake in talking about "the" expansion
>>> service here, as if there was only one well defined service that all
>>> pipenes used. If we go the route of deferring some expansion to the
>>> runner, we need a way of naming expansion services. It seems like this
>>> proposal is simply isomorphic to defining new primitive transforms
>>> which some (all?) runners are just expected to understand.
>>>
>>> On Tue, Aug 20, 2019 at 10:11 AM Thomas Weise  wrote:
>>> >
>>> >
>>> >
>>> > On Tue, Aug 20, 2019 at 8:56 AM Lukasz Cwik  wrote:
>>> >>
>>> >>
>>> >>
>>> >> On Mon, Aug 19, 2019 at 5:52 PM Ahmet Altay  wrote:
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Sun, Aug 18, 2019 at 12:34 PM Thomas Weise 
>>> wrote:
>>> 
>>>  There is a PR open for this:
>>> https://github.com/apache/beam/pull/9331
>>> 
>>>  (it wasn't tagged with the JIRA and therefore not linked)
>>> 
>>>  I think it is worthwhile to explore how we could further detangle
>>> the client side Python and Java dependencies.
>>> 
>>>  The expansion service is one more dependency to consider in a build
>>> environment. Is it really necessary to expand external transforms prior to
>>> submission to the job service?
>>> >>>
>>> >>>
>>> >>> +1, this will make it easier to use external transforms from the
>>> already familiar client environments.
>>> >>>
>>> >>
>>> >>
>>> >> The intent is to make it so that you CAN (not MUST) run an expansion
>>> service separate from a Runner. Creating a single endpoint that hosts both
>>> the Job and Expansion service is something that gRPC does very easily since
>>> you can host 

Re: RFC: python static typing PR

2019-10-28 Thread Chad Dombrova
> Wow, that is an incredible amount of work!
>

Some people meditate.  I annotate ;)

I'm definitely of the opinion that there's no viable counterargument to the
> value of types, especially for large or complex codebases.
>

Agreed.  That's part of why I waited until I got the whole thing passing
before really promoting the review of this PR.

Robert and I have worked out a rough plan for merging:

   - I'll make a new PR with the foundations of the existing PR -- these
   are almost entirely type comments so no appreciable runtime changes --
   along with the mypy lint, which will be part of python precommit but not
   affect its failure status.   We'll get that merged first.
   - From there it should be easy for others to review and merge the
   remaining commits in parallel.
   - Once everything is in, we'll make a final commit to stop ignoring
   failures for the mypy job.


For those concerned about yet another lint (a very reasonable concern!) the
py37-mypy tox job completes in 37s on my macbook pro (excluding virtualenv
setup time).

-chad


Re: Quota issues again

2019-10-28 Thread Chad Dombrova
Can we get more aggressive about separating tests into groups by those that
are dependent on other languages and those that are not?  I think we could
dramatically reduce our backlog if we didn’t run all of the Java tests
every time a commit is made that only affects python code, and vice versa.

-chad


On Mon, Oct 28, 2019 at 3:05 PM Mikhail Gryzykhin  wrote:

> Quota jira issue:
> https://issues.apache.org/jira/browse/BEAM-8195
>
> On Mon, Oct 28, 2019 at 2:05 PM Mikhail Gryzykhin 
> wrote:
>
>> Hi everyone,
>>
>>
>> While validating release branch, I got failure due Quota again. Also, 
>> current queue time for jobs is more than 1.5 hours.
>>
>>
>> I'm not sure if it is worth starting another thread on tests efficiency, but 
>> still want to keep this mail to highlight the issues.
>>
>>
>> See PS for links.
>>
>>
>> Regards,
>>
>> --Mikhail
>>
>>
>> PS:
>>
>> https://builds.apache.org/job/beam_PostCommit_Go_PR/71/consoleFull
>>
>> *13:46:25* 2019/10/28 20:46:25 Test wordcount:kinglear failed: googleapi: 
>> Error 429: Quota exceeded for quota metric 
>> 'dataflow.googleapis.com/create_requests' and limit 
>> 'CreateRequestsPerMinutePerUser' of service 'dataflow.googleapis.com' for 
>> consumer 'project_number:844138762903'., rateLimitExceeded
>>
>>
>> Queue time:
>>
>> http://metrics.beam.apache.org/d/_TNndF2iz/pre-commit-test-latency?orgId=1
>>
>>


RFC: python static typing PR

2019-10-28 Thread Chad Dombrova
Hi all,
I've been working on a PR to add static typing to the beam python sdk for
the past 4 months or so.  This has been an epic journey which has required
chasing down numerous fixes across several other projects (mypy, pylint,
python-future), but the mypy tests are now passing!

I'm not sure how much convincing I need to do with this group on the
benefits of static typing, especially considering the heavy Java influence
on this project, but for those who are curious, there's some good info
here:  https://realpython.com/python-type-checking/#pros-and-cons.  Suffice
it to say, it's a game changer for projects like Beam (large code base,
high level of quality, good testing infrastructure), and it dovetails
perfectly into the efforts surrounding pipeline type analysis and
serialization.  Anecdotally speaking, all of the developers I've worked
with who were originally resistant to static typing in python ("it's
ugly!"  "it's not pythonic!") have changed their tune and fully embraced it
because of the time that it saves them every day.

More details over at the PR:  https://github.com/apache/beam/pull/9056

I look forward to your feedback!

-chad


Re: Install Jenkins AnsiColor plugin

2019-10-22 Thread Chad Dombrova
thanks, so IIUC, I’m going to update job_00_seed.groovy like this:

  wrappers {
colorizeOutput()
timeout {
  absolute(60)
  abortBuild()
}
  }

Then add the comment run seed job

Does anyone have any concerns with me trying this out now?

-chad

On Tue, Oct 22, 2019 at 11:42 AM Udi Meiri  wrote:

> Also note that changing the job DSL doesn't take effect until the "seed"
> job runs. (use the "run seed job" phrase)
>
> On Tue, Oct 22, 2019 at 11:06 AM Chad Dombrova  wrote:
>
>> Thanks, I'll look into this.  I have a PR I'm building up with a handful
>> of minor changes related to this.
>>
>>
>>
>> On Tue, Oct 22, 2019 at 10:45 AM Yifan Zou  wrote:
>>
>>> Thanks, Udi! The ansicolor plugin was applied to ASF Jenkins
>>> universally. You might need to explicitly enable the coloroutput in your
>>> jenkins dsl.
>>>
>>> On Tue, Oct 22, 2019 at 10:33 AM Udi Meiri  wrote:
>>>
>>>> Seems to be already installed:
>>>> https://issues.apache.org/jira/browse/INFRA-16944
>>>> Do we just need to enable it somehow?
>>>> This might work:
>>>> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.helpers.wrapper.WrapperContext.colorizeOutput
>>>>
>>>> BTW, our Jenkins is maintained by ASF's Infrastructure team:
>>>> https://cwiki.apache.org/confluence/display/INFRA/Jenkins
>>>>
>>>> On Tue, Oct 22, 2019 at 10:23 AM Chad Dombrova 
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>> As a user trying to grok failures in jenkins I think it would be a
>>>>> huge help to have color output support.  This is something that works out
>>>>> of the box for CI tools like gitlab and travis, and it really helps bring
>>>>> that 21st century feel to your logs :)
>>>>>
>>>>> There's a Jenkins plugin for colorizing ansi escape sequences here:
>>>>> https://plugins.jenkins.io/ansicolor
>>>>>
>>>>> I think this is something that has to be deployed by a Jenkins admin.
>>>>>
>>>>> -chad
>>>>>
>>>>>


Re: Install Jenkins AnsiColor plugin

2019-10-22 Thread Chad Dombrova
Thanks, I'll look into this.  I have a PR I'm building up with a handful of
minor changes related to this.



On Tue, Oct 22, 2019 at 10:45 AM Yifan Zou  wrote:

> Thanks, Udi! The ansicolor plugin was applied to ASF Jenkins universally.
> You might need to explicitly enable the coloroutput in your jenkins dsl.
>
> On Tue, Oct 22, 2019 at 10:33 AM Udi Meiri  wrote:
>
>> Seems to be already installed:
>> https://issues.apache.org/jira/browse/INFRA-16944
>> Do we just need to enable it somehow?
>> This might work:
>> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.helpers.wrapper.WrapperContext.colorizeOutput
>>
>> BTW, our Jenkins is maintained by ASF's Infrastructure team:
>> https://cwiki.apache.org/confluence/display/INFRA/Jenkins
>>
>> On Tue, Oct 22, 2019 at 10:23 AM Chad Dombrova  wrote:
>>
>>> Hi all,
>>> As a user trying to grok failures in jenkins I think it would be a huge
>>> help to have color output support.  This is something that works out of the
>>> box for CI tools like gitlab and travis, and it really helps bring that
>>> 21st century feel to your logs :)
>>>
>>> There's a Jenkins plugin for colorizing ansi escape sequences here:
>>> https://plugins.jenkins.io/ansicolor
>>>
>>> I think this is something that has to be deployed by a Jenkins admin.
>>>
>>> -chad
>>>
>>>


Install Jenkins AnsiColor plugin

2019-10-22 Thread Chad Dombrova
Hi all,
As a user trying to grok failures in jenkins I think it would be a huge
help to have color output support.  This is something that works out of the
box for CI tools like gitlab and travis, and it really helps bring that
21st century feel to your logs :)

There's a Jenkins plugin for colorizing ansi escape sequences here:
https://plugins.jenkins.io/ansicolor

I think this is something that has to be deployed by a Jenkins admin.

-chad


Re: Test failures in python precommit: ZipFileArtifactServiceTest

2019-10-21 Thread Chad Dombrova
thanks again!

On Mon, Oct 21, 2019 at 1:03 PM Robert Bradshaw  wrote:

> I just merged https://github.com/apache/beam/pull/9845 which should
> resolve the issue.
>
> On Mon, Oct 21, 2019 at 12:58 PM Chad Dombrova  wrote:
> >
> > thanks!
> >
> > On Mon, Oct 21, 2019 at 12:47 PM Kyle Weaver 
> wrote:
> >>
> >> This issue is being tracked at
> https://issues.apache.org/jira/browse/BEAM-8416.
> >>
> >> On Mon, Oct 21, 2019 at 9:42 PM Chad Dombrova 
> wrote:
> >>>
> >>> Hi all,
> >>> Is anyone else getting these errors in
> apache_beam.runners.portability.artifact_service_test.ZipFileArtifactServiceTest?
> >>>
> >>> They seem to be taking two forms:
> >>>
> >>> zipfile.BadZipFile: Bad CRC-32 for file
> '/3e3ff9aa4fe679c1bf76383e69bfb5e2167afb945aa30e15f05406cc8f55ad14/9367417d63903350aeb7e092bca792263d4fd82d4912252e014e073a8931b4c1'
> >>>
> >>> zipfile.BadZipFile: Bad magic number for file header
> >>>
> >>> Here are some gradle scans:
> >>>
> >>>
> https://scans.gradle.com/s/b7jd7oyu5f5f6/console-log?task=:sdks:python:test-suites:tox:py37:testPy37Cython#L14473
> >>>
> >>>
> https://scans.gradle.com/s/4iega3kyf5kw2/console-log?task=:sdks:python:test-suites:tox:py37:testPython37#L13749
> >>>
> >>> I got it to go through eventually after 4 tries.
> >>>
> >>> -chad
>


Re: Test failures in python precommit: ZipFileArtifactServiceTest

2019-10-21 Thread Chad Dombrova
thanks!

On Mon, Oct 21, 2019 at 12:47 PM Kyle Weaver  wrote:

> This issue is being tracked at
> https://issues.apache.org/jira/browse/BEAM-8416.
>
> On Mon, Oct 21, 2019 at 9:42 PM Chad Dombrova  wrote:
>
>> Hi all,
>> Is anyone else getting these errors in
>> apache_beam.runners.portability.artifact_service_test.ZipFileArtifactServiceTest?
>>
>> They seem to be taking two forms:
>>
>> zipfile.BadZipFile: Bad CRC-32 for file 
>> '/3e3ff9aa4fe679c1bf76383e69bfb5e2167afb945aa30e15f05406cc8f55ad14/9367417d63903350aeb7e092bca792263d4fd82d4912252e014e073a8931b4c1'
>>
>> zipfile.BadZipFile: Bad magic number for file header
>>
>> Here are some gradle scans:
>>
>>
>> https://scans.gradle.com/s/b7jd7oyu5f5f6/console-log?task=:sdks:python:test-suites:tox:py37:testPy37Cython#L14473
>>
>>
>> https://scans.gradle.com/s/4iega3kyf5kw2/console-log?task=:sdks:python:test-suites:tox:py37:testPython37#L13749
>>
>> I got it to go through eventually after 4 tries.
>>
>> -chad
>>
>


Test failures in python precommit: ZipFileArtifactServiceTest

2019-10-21 Thread Chad Dombrova
Hi all,
Is anyone else getting these errors in
apache_beam.runners.portability.artifact_service_test.ZipFileArtifactServiceTest?

They seem to be taking two forms:

zipfile.BadZipFile: Bad CRC-32 for file
'/3e3ff9aa4fe679c1bf76383e69bfb5e2167afb945aa30e15f05406cc8f55ad14/9367417d63903350aeb7e092bca792263d4fd82d4912252e014e073a8931b4c1'

zipfile.BadZipFile: Bad magic number for file header

Here are some gradle scans:

https://scans.gradle.com/s/b7jd7oyu5f5f6/console-log?task=:sdks:python:test-suites:tox:py37:testPy37Cython#L14473

https://scans.gradle.com/s/4iega3kyf5kw2/console-log?task=:sdks:python:test-suites:tox:py37:testPython37#L13749

I got it to go through eventually after 4 tries.

-chad


Beam on Flink with "active" k8s support

2019-10-19 Thread Chad Dombrova
Hi all,
I've been following the Jira issue on Flink "active" k8s support
(autoscaling based on task resource requirements, IIUC) and there has been
a lot of activity there lately.  There are two design docs [2][3] from
different teams and it seems like some good collaboration is going on to
reconcile the differences and get the feature implemented.

At the Beam Summit I saw the great presentations by Thomas and Micah on the
work that Lyft has been doing to run Flink and Beam on k8s.  I'm curious if
these various features and approaches can be made to work with each other,
and if so, what it would take to do so.

thanks,
-chad

[1] https://issues.apache.org/jira/browse/FLINK-9953
[2]
https://docs.google.com/document/d/1-jNzqGF6NfZuwVaFICoFQ5HFFXzF5NVIagUZByFMfBY/edit
[3]
https://docs.google.com/document/d/1Or1hmDXABk1Q3ToITMi0HoZyHLs2nZ6gk7E0y2sEe28/edit#heading=h.ubqno6n4vwrl


Re: RFC: Assigning environments to transforms in a pipeline

2019-10-16 Thread Chad Dombrova
Hi Robert,

> Sounds nice. Is there a design doc (or, perhaps, you could just give an
> example of what this would look like in this thread)?
>

I'll follow up shortly with something.  The good news is that this first PR
is quite straightforward and (I think) is independent of the semantics of
how an Environment will ultimately be used.  This PR just answers the
questions: "how do we represent an Environment outside of the portability
framework, and how do we convert that to and from the portability
representation?".  We already have very well established patterns for these
questions.

-chad


RFC: Assigning environments to transforms in a pipeline

2019-10-16 Thread Chad Dombrova
Hi all,
One of our goals for the portability framework is to be able to assign
different environments to different segments of a pipeline.  This is not
possible right now because environments are a concept that really only
exist in the portable runner as protobuf messages:  they lack a proper API
on the pipeline definition side of things.

As a first step toward our goal, one of our team members just created a
PR[1] exposing environments as a proper class hierarchy, akin to
PTransforms, PCollections, Coders, etc. It's quite straightforward, and we
were careful to adhere to existing patterns for similar types, so hopefully
the end result feels natural.  After this PR is merged, our next step will
be to create a proposal for assigning environments to transforms.

Let us know what you think!

-chad

[1] https://github.com/apache/beam/pull/9811


RFC: switching to pylint 2.4

2019-10-10 Thread Chad Dombrova
Hi all,
I made a PR to update to the latest version of pylint here:
https://github.com/apache/beam/pull/9725

I think it's ready to go, but would love to get a few eyes on it.

-chad


Re: [DISCUSS] JIRA open/close emails?

2019-10-10 Thread Chad Dombrova
On Thu, Oct 10, 2019 at 10:56 AM Maximilian Michels  wrote:

> Our mailing list already has decent traffic. The current solution is
> better for its readability. People would simply adapt to more emails by
> creating a filter or ignoring them.



+1. I find the dev generally informative, but as a part-time external
contributor open/close emails would detract from its usefulness for me.

As an alternative to a dedicated mailing list for this purpose, we could
create a wiki article on how to setup customized email notifications from
Jira, with an example specifically for open/close events.  That’s something
I might actually use, but only with additional filters like “sdk-python”.

-chad



>
> I know that the triaging of issues can be slow. How about a dedicated
> mailing for only open/close notifications, and a weekly / monthly
> summary of untriaged issues to the dev mailing list?
>
> Thanks,
> Max
>
> On 09.10.19 21:58, Thomas Weise wrote:
> > Depends on JIRA volume, also. There are projects where the dev@ is
> > heavily populated with create JIRA notifications (and other traffic goes
> > under a bit).
> >
> > I'm generally in favor of making the creation of JIRAs more visible. But
> > if there isn't broad enough interest these notifications can still be
> > surfaced individually by filtering.
> >
> > With "close" you probably mean resolved? That's kind of nice to know
> > when something is complete, but I'm unsure if it should go to dev@,
> > because create provided the opportunity for interested parties to watch
> > the JIRA they care about.
> >
> >
> > On Wed, Oct 9, 2019 at 8:58 PM Manu Zhang  > > wrote:
> >
> > +1. Like the subscribe to only "close issue" action instead of each
> > comment on GitHub
> >
> > Manu
> >
> > On Thu, Oct 10, 2019 at 10:51 AM Kenneth Knowles  > > wrote:
> >
> > Currently, all email from JIRA goes to iss...@beam.apache.org
> > .
> >
> > I just learned that HBase has a JIRA configuration so that issue
> > open/close goes to dev@ and every other notification goes to
> > issues@. That actually seems nice for involving all of the
> > community in new issue triage and also closed issue
> > verification/notification.
> >
> > What do you think about such a system?
> >
> > Kenn
> >
>


Re: NOTICE: New Python PreCommit jobs

2019-10-07 Thread Chad Dombrova
There's a lot of value to switching to pytest even without xdist.  Could we
prune back the goals of this first PR to just achieving feature parity with
nose, and make a followup PR for xdist?

-chad

On Mon, Oct 7, 2019 at 12:04 PM Udi Meiri  wrote:

>
>
> On Fri, Oct 4, 2019 at 10:35 AM Chad Dombrova  wrote:
>
>>
>> I have a WiP PR to convert Beam to use pytest, but it's been stalled.
>>>
>>
>> What would it take to get it back on track?
>>
>
> Besides needing to convert ITs (removing save_main_session), which can be
> split out to a later PR, there's verifying that the same set of tests are
> collected for each suite.
>
>
>>
>>
>>> Another nice thing about pytest is that you'll be able to tell which
>>> suite a test belongs to.
>>>
>>
>> pytest has a lot of quality of life improvements over nose.  The biggest
>> and simplest one is that the test name that it prints is in the same format
>> as the runner expects for specifying individual tests to run, so you can
>> just copy and paste on the command line to run that one test.  Genius.
>> Also, since it uses directory names for tests and not module names, you can
>> tab complete.   The whole fixture
>>
> LOL at the copy-paste issue.
>
>
>> concept is also great, since it gives you a new axis for test
>> composability and reuse, instead of just complex sub-classing or
>> copy-and-paste.   After switching to pytest we went through our tests and
>> replaced all of our horrible test mixins with fixtures and the end result
>> is much more legible and maintainable.  There's honestly nothing I miss
>> about nose.
>>
>> -chad
>>
>>
>>
>>
>>


Re: Why is there no standard boolean coder?

2019-10-06 Thread Chad Dombrova
PR is ready for review:  https://github.com/apache/beam/pull/9735



On Mon, Sep 30, 2019 at 12:03 PM Maximilian Michels  wrote:

> +1
>
> On 29.09.19 13:44, Chad Dombrova wrote:
> > I’m planning on porting the existing Java coder to Python. Any
> > objections to that?
> >
> > -chad
> >
> >
> > On Sun, Sep 29, 2019 at 1:02 PM Robert Burke  > <mailto:rob...@frantil.com>> wrote:
> >
> > +1
> >
> > I'm happy to whip together the Go SDK version once the encoding has
> > been concretely decided.
> >
> > On Fri, Sep 27, 2019, 6:07 PM Chad Dombrova  > <mailto:chad...@gmail.com>> wrote:
> >
> >
> > It would still be a standard coder - the distinction I'm
> > proposing is that there are certain coders that _must_ be
> > implemented by a new runner/sdk (for example windowedvalue,
> > varint, kv, ...) since they are important for SDK - runner
> > communication, but now we're starting to standardize coders
> > that are useful for cross-language and schemas.
> >
> > Got it.  Sounds good.
> >
> > -chad
> >
> >
>


Re: Beam Summit Videos in youtube

2019-10-05 Thread Chad Dombrova
Hi Max,
Any progress on this?  There are a few talks I really want to review!

-chad


On Wed, Sep 18, 2019 at 12:56 PM Maximilian Michels  wrote:

> Hi Rahul,
>
> The Beam Summit committee is working on this at the moment. Stay tuned.
>
> Thanks,
> Max
>
> On 18.09.19 11:39, rahul patwari wrote:
> > Hi,
> >
> > The videos of Beam Summit that has happened recently have disappeared
> > from YouTube Apache Beam channel.
> >
> > Is uploading the videos a WIP?
> >
> > Thanks,
> > Rahul
> >
> >
>


Re: NOTICE: New Python PreCommit jobs

2019-10-04 Thread Chad Dombrova
> I have a WiP PR to convert Beam to use pytest, but it's been stalled.
>

What would it take to get it back on track?


> Another nice thing about pytest is that you'll be able to tell which suite
> a test belongs to.
>

pytest has a lot of quality of life improvements over nose.  The biggest
and simplest one is that the test name that it prints is in the same format
as the runner expects for specifying individual tests to run, so you can
just copy and paste on the command line to run that one test.  Genius.
Also, since it uses directory names for tests and not module names, you can
tab complete.   The whole fixture concept is also great, since it gives you
a new axis for test composability and reuse, instead of just complex
sub-classing or copy-and-paste.   After switching to pytest we went through
our tests and replaced all of our horrible test mixins with fixtures and
the end result is much more legible and maintainable.  There's honestly
nothing I miss about nose.

-chad


Re: NOTICE: New Python PreCommit jobs

2019-10-02 Thread Chad Dombrova
Hi all,
I've posted a new PR that just splits out the python lint job here:
https://github.com/apache/beam/pull/9706

I'll be running the seed job shortly unless anyone objects.

-chad


On Tue, Oct 1, 2019 at 9:04 PM Chad Dombrova  wrote:

> I haven’t used nose’s parallel execution plugin, but I have used pytest
> with xdist with success. If your tests are designed to run in any order and
> are properly sandboxed to prevent crosstalk between concurrent runs, which
> they *should* be, then in my experience it works very well.
>
>
> On Fri, Sep 27, 2019 at 6:51 PM Kenneth Knowles  wrote:
>
>> Do things go wrong when nose is configured to use parallel execution?
>>
>> On Fri, Sep 27, 2019 at 5:09 PM Chad Dombrova  wrote:
>>
>>> By the way, the outcome on this was that splitting the python precommit
>>> job into one job per python version resulted in increasing the total test
>>> completion time by 66%, which is obviously not good.  This is because we
>>> are using Gradle to run the python tests tasks in parallel (the jenkins VMs
>>> have 16 cores each, utilized across 2 slots, IIRC), but after the split
>>> there were only 1-2 gradle tasks per test.  Since the python test runner,
>>> nose, is currently not using parallel execution, there were not enough
>>> concurrent tasks to make proper use of the VM's CPUs.
>>>
>>> tl;dr  I'm going to create a followup PR to split out just the Lint job
>>> (same as we have Spotless for Java).   This is our best ROI for now.
>>>
>>> -chad
>>>
>>>
>>> On Fri, Sep 27, 2019 at 3:27 PM Kyle Weaver  wrote:
>>>
>>>> > Do we have good pypi caching?
>>>>
>>>> Building Python SDK harness containers takes 2 mins each (times 4, the
>>>> number of versions) on my machine, even if nothing has changed. But we're
>>>> already paying that cost, so I don't think splitting the jobs should make
>>>> it any worse. (https://issues.apache.org/jira/browse/BEAM-8277 if
>>>> anyone has any ideas)
>>>>
>>>> Kyle Weaver | Software Engineer | github.com/ibzib |
>>>> kcwea...@google.com
>>>>
>>>>
>>>> On Wed, Sep 25, 2019 at 11:21 AM Pablo Estrada 
>>>> wrote:
>>>>
>>>>> Thanks Chad, and thank you for notifying on the dev list.
>>>>>
>>>>> On Wed, Sep 25, 2019 at 10:59 AM Kenneth Knowles 
>>>>> wrote:
>>>>>
>>>>>> Nice.
>>>>>>
>>>>>> Do we have good pypi caching? If not this could add a lot of overhead
>>>>>> to our already-backed-up CI queue. (btw I still think your change is 
>>>>>> good,
>>>>>> and just makes proper caching more important)
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>> On Tue, Sep 24, 2019 at 9:55 PM Chad Dombrova 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>> I'm working to make the CI experience with python a bit better, and
>>>>>>> my current initiative is splitting up the giant Python PreCommit job 
>>>>>>> into 5
>>>>>>> separate jobs into separate jobs for Lint, Py2, Py3.5, Py3.6, and Py3.7.
>>>>>>>
>>>>>>> Around 11am Pacific time tomorrow I'm going to initiate the seed
>>>>>>> jobs, at which point all PRs will start to run the new precommit jobs.
>>>>>>> It's a bit of a chicken-and-egg scenario with testing this, so there 
>>>>>>> could
>>>>>>> be issues that pop up after the seed jobs are created, but I'll be 
>>>>>>> working
>>>>>>> to resolve those issues as quickly as possible.
>>>>>>>
>>>>>>> If you run into problems because of this change, please let me know
>>>>>>> on the github PR.
>>>>>>>
>>>>>>> Here's the PR: https://github.com/apache/beam/pull/9642
>>>>>>> Here's the Jira: https://issues.apache.org/jira/browse/BEAM-8213#
>>>>>>>
>>>>>>> The upshot is that after this is done you'll get better feedback on
>>>>>>> python test failures!
>>>>>>>
>>>>>>> Let me know if you have any concerns.
>>>>>>>
>>>>>>> thanks,
>>>>>>> chad
>>>>>>>
>>>>>>>


Re: NOTICE: New Python PreCommit jobs

2019-10-01 Thread Chad Dombrova
I haven’t used nose’s parallel execution plugin, but I have used pytest
with xdist with success. If your tests are designed to run in any order and
are properly sandboxed to prevent crosstalk between concurrent runs, which
they *should* be, then in my experience it works very well.


On Fri, Sep 27, 2019 at 6:51 PM Kenneth Knowles  wrote:

> Do things go wrong when nose is configured to use parallel execution?
>
> On Fri, Sep 27, 2019 at 5:09 PM Chad Dombrova  wrote:
>
>> By the way, the outcome on this was that splitting the python precommit
>> job into one job per python version resulted in increasing the total test
>> completion time by 66%, which is obviously not good.  This is because we
>> are using Gradle to run the python tests tasks in parallel (the jenkins VMs
>> have 16 cores each, utilized across 2 slots, IIRC), but after the split
>> there were only 1-2 gradle tasks per test.  Since the python test runner,
>> nose, is currently not using parallel execution, there were not enough
>> concurrent tasks to make proper use of the VM's CPUs.
>>
>> tl;dr  I'm going to create a followup PR to split out just the Lint job
>> (same as we have Spotless for Java).   This is our best ROI for now.
>>
>> -chad
>>
>>
>> On Fri, Sep 27, 2019 at 3:27 PM Kyle Weaver  wrote:
>>
>>> > Do we have good pypi caching?
>>>
>>> Building Python SDK harness containers takes 2 mins each (times 4, the
>>> number of versions) on my machine, even if nothing has changed. But we're
>>> already paying that cost, so I don't think splitting the jobs should make
>>> it any worse. (https://issues.apache.org/jira/browse/BEAM-8277 if
>>> anyone has any ideas)
>>>
>>> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
>>>
>>>
>>> On Wed, Sep 25, 2019 at 11:21 AM Pablo Estrada 
>>> wrote:
>>>
>>>> Thanks Chad, and thank you for notifying on the dev list.
>>>>
>>>> On Wed, Sep 25, 2019 at 10:59 AM Kenneth Knowles 
>>>> wrote:
>>>>
>>>>> Nice.
>>>>>
>>>>> Do we have good pypi caching? If not this could add a lot of overhead
>>>>> to our already-backed-up CI queue. (btw I still think your change is good,
>>>>> and just makes proper caching more important)
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Tue, Sep 24, 2019 at 9:55 PM Chad Dombrova 
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>> I'm working to make the CI experience with python a bit better, and
>>>>>> my current initiative is splitting up the giant Python PreCommit job 
>>>>>> into 5
>>>>>> separate jobs into separate jobs for Lint, Py2, Py3.5, Py3.6, and Py3.7.
>>>>>>
>>>>>> Around 11am Pacific time tomorrow I'm going to initiate the seed
>>>>>> jobs, at which point all PRs will start to run the new precommit jobs.
>>>>>> It's a bit of a chicken-and-egg scenario with testing this, so there 
>>>>>> could
>>>>>> be issues that pop up after the seed jobs are created, but I'll be 
>>>>>> working
>>>>>> to resolve those issues as quickly as possible.
>>>>>>
>>>>>> If you run into problems because of this change, please let me know
>>>>>> on the github PR.
>>>>>>
>>>>>> Here's the PR: https://github.com/apache/beam/pull/9642
>>>>>> Here's the Jira: https://issues.apache.org/jira/browse/BEAM-8213#
>>>>>>
>>>>>> The upshot is that after this is done you'll get better feedback on
>>>>>> python test failures!
>>>>>>
>>>>>> Let me know if you have any concerns.
>>>>>>
>>>>>> thanks,
>>>>>> chad
>>>>>>
>>>>>>


Re: Why is there no standard boolean coder?

2019-09-29 Thread Chad Dombrova
I’m planning on porting the existing Java coder to Python. Any objections
to that?

-chad


On Sun, Sep 29, 2019 at 1:02 PM Robert Burke  wrote:

> +1
>
> I'm happy to whip together the Go SDK version once the encoding has been
> concretely decided.
>
> On Fri, Sep 27, 2019, 6:07 PM Chad Dombrova  wrote:
>
>>
>> It would still be a standard coder - the distinction I'm proposing is
>>> that there are certain coders that _must_ be implemented by a new
>>> runner/sdk (for example windowedvalue, varint, kv, ...) since they are
>>> important for SDK - runner communication, but now we're starting to
>>> standardize coders that are useful for cross-language and schemas.
>>>
>>
>> Got it.  Sounds good.
>>
>> -chad
>>
>>
>>


Re: Why is there no standard boolean coder?

2019-09-27 Thread Chad Dombrova
> It would still be a standard coder - the distinction I'm proposing is that
> there are certain coders that _must_ be implemented by a new runner/sdk
> (for example windowedvalue, varint, kv, ...) since they are important for
> SDK - runner communication, but now we're starting to standardize coders
> that are useful for cross-language and schemas.
>

Got it.  Sounds good.

-chad


Re: Why is there no standard boolean coder?

2019-09-27 Thread Chad Dombrova
Would BooleanCoder continue to fall into this category?  I was under the
impression we might make it a full fledge standard coder with this PR.



On Fri, Sep 27, 2019 at 5:32 PM Brian Hulette  wrote:

> +1, thank you!
>
> Note In my Row Coder PR I added a new section for "Additional Standard
> Coders" - i.e. coders that have a URN, but aren't required for a new
> runner/sdk to implement the beam model:
> https://github.com/apache/beam/pull/9188/files#diff-f0d64c2cfc4583bfe2a7e5ee59818ae2R646
>
> I think this would belong there as well, assuming that is a distinction we
> want to make.
>
> On Fri, Sep 27, 2019 at 5:22 PM Thomas Weise  wrote:
>
>> +1 for adding the coder
>>
>> Please also add a test here:
>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>>
>>
>> On Fri, Sep 27, 2019 at 5:17 PM Chad Dombrova  wrote:
>>
>>> Are there any dissenting votes to making a BooleanCoder a standard
>>> (portable) coder?
>>>
>>> I'm happy to make a PR to implement a BooleanCoder in python (and to add
>>> the Java BooleanCoder to the ModelCoderRegistrar) if everyone agrees that
>>> this is useful.
>>>
>>> -chad
>>>
>>>
>>> On Fri, Sep 27, 2019 at 3:32 PM Robert Bradshaw 
>>> wrote:
>>>
>>>> I think boolean is useful to have. What I'm more skeptical of is
>>>> adding standard types for variations like UnsignedInteger16, etc. that
>>>> don't have natural representations in all languages.
>>>>
>>>> On Fri, Sep 27, 2019 at 2:46 PM Brian Hulette 
>>>> wrote:
>>>> >
>>>> > Some more context from an offline discussion I had with +Robert
>>>> Bradshaw a while ago: We both agreed all of the coders listed in BEAM-7996
>>>> should be implemented in Python, but didn't come to a conclusion on whether
>>>> or not they should actually be _standard_ coders, versus just being
>>>> implicitly standard as part of row coder.
>>>> >
>>>> > On Fri, Sep 27, 2019 at 2:29 PM Kenneth Knowles 
>>>> wrote:
>>>> >>
>>>> >> Yes, noted here:
>>>> https://github.com/apache/beam/pull/9188/files#diff-f0d64c2cfc4583bfe2a7e5ee59818ae2R678
>>>> and that links to https://issues.apache.org/jira/browse/BEAM-7996
>>>> >>
>>>> >> Kenn
>>>> >>
>>>> >> On Fri, Sep 27, 2019 at 12:57 PM Reuven Lax 
>>>> wrote:
>>>> >>>
>>>> >>> Java has one, implemented as a byte coder. My guess is that nobody
>>>> has gotten around to implementing it yet for portability.
>>>> >>>
>>>> >>> On Fri, Sep 27, 2019 at 12:44 PM Chad Dombrova 
>>>> wrote:
>>>> >>>>
>>>> >>>> Hi all,
>>>> >>>> It seems a bit unfortunate that there isn’t a portable way to
>>>> serialize a boolean value.
>>>> >>>>
>>>> >>>> I’m working on porting my external PubsubIO PR over to use the
>>>> improved schema-based external transform API in python, but because of this
>>>> limitation I can’t use boolean values. For example, this fails:
>>>> >>>>
>>>> >>>> ReadFromPubsubSchema = typing.NamedTuple(
>>>> >>>> 'ReadFromPubsubSchema',
>>>> >>>> [
>>>> >>>> ('topic', typing.Optional[unicode]),
>>>> >>>> ('subscription', typing.Optional[unicode]),
>>>> >>>> ('id_label',  typing.Optional[unicode]),
>>>> >>>> ('with_attributes', bool),
>>>> >>>> ('timestamp_attribute',  typing.Optional[unicode]),
>>>> >>>> ]
>>>> >>>> )
>>>> >>>>
>>>> >>>> It fails because coders.get_coder(bool) returns the non-portable
>>>> pickle coder.
>>>> >>>>
>>>> >>>> In the short term I can hack something into the external transform
>>>> API to use varint coder for bools, but this kind of hacky approach to
>>>> portability won’t work in scenarios where round-tripping is required
>>>> without user intervention. In other words, in python it is not uncommon to
>>>> test if x is True, in which case the integer 1 would fail this test. All of
>>>> that is to say that a BooleanCoder would be a convenient way to ensure the
>>>> proper type is used everywhere.
>>>> >>>>
>>>> >>>> So, I was just wondering why it’s not there? Are there concerns
>>>> over whether booleans are universal enough to make part of the portability
>>>> standard?
>>>> >>>>
>>>> >>>> -chad
>>>>
>>>


Re: Why is there no standard boolean coder?

2019-09-27 Thread Chad Dombrova
Are there any dissenting votes to making a BooleanCoder a standard
(portable) coder?

I'm happy to make a PR to implement a BooleanCoder in python (and to add
the Java BooleanCoder to the ModelCoderRegistrar) if everyone agrees that
this is useful.

-chad


On Fri, Sep 27, 2019 at 3:32 PM Robert Bradshaw  wrote:

> I think boolean is useful to have. What I'm more skeptical of is
> adding standard types for variations like UnsignedInteger16, etc. that
> don't have natural representations in all languages.
>
> On Fri, Sep 27, 2019 at 2:46 PM Brian Hulette  wrote:
> >
> > Some more context from an offline discussion I had with +Robert Bradshaw
> a while ago: We both agreed all of the coders listed in BEAM-7996 should be
> implemented in Python, but didn't come to a conclusion on whether or not
> they should actually be _standard_ coders, versus just being implicitly
> standard as part of row coder.
> >
> > On Fri, Sep 27, 2019 at 2:29 PM Kenneth Knowles  wrote:
> >>
> >> Yes, noted here:
> https://github.com/apache/beam/pull/9188/files#diff-f0d64c2cfc4583bfe2a7e5ee59818ae2R678
> and that links to https://issues.apache.org/jira/browse/BEAM-7996
> >>
> >> Kenn
> >>
> >> On Fri, Sep 27, 2019 at 12:57 PM Reuven Lax  wrote:
> >>>
> >>> Java has one, implemented as a byte coder. My guess is that nobody has
> gotten around to implementing it yet for portability.
> >>>
> >>> On Fri, Sep 27, 2019 at 12:44 PM Chad Dombrova 
> wrote:
> >>>>
> >>>> Hi all,
> >>>> It seems a bit unfortunate that there isn’t a portable way to
> serialize a boolean value.
> >>>>
> >>>> I’m working on porting my external PubsubIO PR over to use the
> improved schema-based external transform API in python, but because of this
> limitation I can’t use boolean values. For example, this fails:
> >>>>
> >>>> ReadFromPubsubSchema = typing.NamedTuple(
> >>>> 'ReadFromPubsubSchema',
> >>>> [
> >>>> ('topic', typing.Optional[unicode]),
> >>>> ('subscription', typing.Optional[unicode]),
> >>>> ('id_label',  typing.Optional[unicode]),
> >>>> ('with_attributes', bool),
> >>>> ('timestamp_attribute',  typing.Optional[unicode]),
> >>>> ]
> >>>> )
> >>>>
> >>>> It fails because coders.get_coder(bool) returns the non-portable
> pickle coder.
> >>>>
> >>>> In the short term I can hack something into the external transform
> API to use varint coder for bools, but this kind of hacky approach to
> portability won’t work in scenarios where round-tripping is required
> without user intervention. In other words, in python it is not uncommon to
> test if x is True, in which case the integer 1 would fail this test. All of
> that is to say that a BooleanCoder would be a convenient way to ensure the
> proper type is used everywhere.
> >>>>
> >>>> So, I was just wondering why it’s not there? Are there concerns over
> whether booleans are universal enough to make part of the portability
> standard?
> >>>>
> >>>> -chad
>


Re: NOTICE: New Python PreCommit jobs

2019-09-27 Thread Chad Dombrova
By the way, the outcome on this was that splitting the python precommit job
into one job per python version resulted in increasing the total test
completion time by 66%, which is obviously not good.  This is because we
are using Gradle to run the python tests tasks in parallel (the jenkins VMs
have 16 cores each, utilized across 2 slots, IIRC), but after the split
there were only 1-2 gradle tasks per test.  Since the python test runner,
nose, is currently not using parallel execution, there were not enough
concurrent tasks to make proper use of the VM's CPUs.

tl;dr  I'm going to create a followup PR to split out just the Lint job
(same as we have Spotless for Java).   This is our best ROI for now.

-chad


On Fri, Sep 27, 2019 at 3:27 PM Kyle Weaver  wrote:

> > Do we have good pypi caching?
>
> Building Python SDK harness containers takes 2 mins each (times 4, the
> number of versions) on my machine, even if nothing has changed. But we're
> already paying that cost, so I don't think splitting the jobs should make
> it any worse. (https://issues.apache.org/jira/browse/BEAM-8277 if anyone
> has any ideas)
>
> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
>
>
> On Wed, Sep 25, 2019 at 11:21 AM Pablo Estrada  wrote:
>
>> Thanks Chad, and thank you for notifying on the dev list.
>>
>> On Wed, Sep 25, 2019 at 10:59 AM Kenneth Knowles  wrote:
>>
>>> Nice.
>>>
>>> Do we have good pypi caching? If not this could add a lot of overhead to
>>> our already-backed-up CI queue. (btw I still think your change is good, and
>>> just makes proper caching more important)
>>>
>>> Kenn
>>>
>>> On Tue, Sep 24, 2019 at 9:55 PM Chad Dombrova  wrote:
>>>
>>>> Hi all,
>>>> I'm working to make the CI experience with python a bit better, and my
>>>> current initiative is splitting up the giant Python PreCommit job into 5
>>>> separate jobs into separate jobs for Lint, Py2, Py3.5, Py3.6, and Py3.7.
>>>>
>>>> Around 11am Pacific time tomorrow I'm going to initiate the seed jobs,
>>>> at which point all PRs will start to run the new precommit jobs.  It's a
>>>> bit of a chicken-and-egg scenario with testing this, so there could be
>>>> issues that pop up after the seed jobs are created, but I'll be working to
>>>> resolve those issues as quickly as possible.
>>>>
>>>> If you run into problems because of this change, please let me know on
>>>> the github PR.
>>>>
>>>> Here's the PR: https://github.com/apache/beam/pull/9642
>>>> Here's the Jira: https://issues.apache.org/jira/browse/BEAM-8213#
>>>>
>>>> The upshot is that after this is done you'll get better feedback on
>>>> python test failures!
>>>>
>>>> Let me know if you have any concerns.
>>>>
>>>> thanks,
>>>> chad
>>>>
>>>>


Why is there no standard boolean coder?

2019-09-27 Thread Chad Dombrova
Hi all,
It seems a bit unfortunate that there isn’t a portable way to serialize a
boolean value.

I’m working on porting my external PubsubIO PR over to use the improved
schema-based external transform API in python, but because of this
limitation I can’t use boolean values. For example, this fails:

ReadFromPubsubSchema = typing.NamedTuple(
'ReadFromPubsubSchema',
[
('topic', typing.Optional[unicode]),
('subscription', typing.Optional[unicode]),
('id_label',  typing.Optional[unicode]),
('with_attributes', bool),
('timestamp_attribute',  typing.Optional[unicode]),
]
)

It fails because coders.get_coder(bool) returns the non-portable pickle
coder.

In the short term I can hack something into the external transform API to
use varint coder for bools, but this kind of hacky approach to portability
won’t work in scenarios where round-tripping is required without user
intervention. In other words, in python it is not uncommon to test if x is
True, in which case the integer 1 would fail this test. All of that is to
say that a BooleanCoder would be a convenient way to ensure the proper type
is used everywhere.

So, I was just wondering why it’s not there? Are there concerns over
whether booleans are universal enough to make part of the portability
standard?

-chad


NOTICE: New Python PreCommit jobs

2019-09-24 Thread Chad Dombrova
Hi all,
I'm working to make the CI experience with python a bit better, and my
current initiative is splitting up the giant Python PreCommit job into 5
separate jobs into separate jobs for Lint, Py2, Py3.5, Py3.6, and Py3.7.

Around 11am Pacific time tomorrow I'm going to initiate the seed jobs, at
which point all PRs will start to run the new precommit jobs.  It's a bit
of a chicken-and-egg scenario with testing this, so there could be issues
that pop up after the seed jobs are created, but I'll be working to resolve
those issues as quickly as possible.

If you run into problems because of this change, please let me know on the
github PR.

Here's the PR: https://github.com/apache/beam/pull/9642
Here's the Jira: https://issues.apache.org/jira/browse/BEAM-8213#

The upshot is that after this is done you'll get better feedback on python
test failures!

Let me know if you have any concerns.

thanks,
chad


Re: Collecting feedback for Beam usage

2019-09-23 Thread Chad Dombrova
A survey would be a good place to start.  This came up in the
python2-sunsetting thread as well: we don't know what versions of python
people are using with Beam, which makes it difficult to answer the question
of support.

-chad


On Mon, Sep 23, 2019 at 2:57 PM Ankur Goenka  wrote:

> I agree, these are the questions that need to be answered.
> The data can be anonymize and stored as public data in BigQuery or some
> other place.
>
> The intent is to get the usage statistics so that we can get to know what
> people are using Flink or Spark etc and not intended for discussion or a
> help channel.
> I also think that we don't need to monitor this actively as it's more like
> a survey rather than active channel to get issues resolved.
>
> If we think its useful for the community then we come up with the solution
> as to how can we do this (similar to how we released the container images).
>
>
>
> On Fri, Sep 20, 2019 at 4:38 PM Kyle Weaver  wrote:
>
>> There are some logistics that would need worked out. For example, Where
>> would the data go? Who would own it?
>>
>> Also, I'm not convinced we need yet another place to discuss Beam when we
>> already have discussed the challenge of simultaneously monitoring mailing
>> lists, Stack Overflow, Slack, etc. While "how do you use Beam" is certainly
>> an interesting question, and I'd be curious to know that >= X many people
>> use a certain runner, I'm not sure answers to these questions are as useful
>> for guiding the future of Beam as discussions on the dev/users lists, etc.
>> as the latter likely result in more depth/specific feedback.
>>
>> However, I do think it could be useful in general to include links
>> directly in the console output. For example, maybe something along the
>> lines of "Oh no, your Flink pipeline crashed! Check Jira/file a bug/ask the
>> mailing list."
>>
>> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
>>
>>
>> On Fri, Sep 20, 2019 at 4:14 PM Ankur Goenka  wrote:
>>
>>> Hi,
>>>
>>> At the moment we don't really have a good way to collect any usage
>>> statistics for Apache Beam. Like runner used etc. As many of the users
>>> don't really have a way to report their usecase.
>>> How about if we create a feedback page where users can add their
>>> pipeline details and usecase.
>>> Also, we can start printing the link to this page when user launch the
>>> pipeline in the command line.
>>> Example:
>>> $ python my_pipeline.py --runner DirectRunner --input /tmp/abc
>>>
>>> Starting pipeline
>>> Please use
>>> http://feedback.beam.org?args=runner=DirectRunner,input=/tmp/abc
>>> Pipeline started
>>> ..
>>>
>>> Using a link and not publishing the data automatically will give user
>>> control over what they publish and what they don't. We can enhance the text
>>> and usage further but the basic idea is to ask for user feeback at each run
>>> of the pipeline.
>>> Let me know what you think.
>>>
>>>
>>> Thanks,
>>> Ankur
>>>
>>


Re: When will Beam drop support for python2?

2019-09-19 Thread Chad Dombrova
thanks, email sent!  let me know if it got screwed up. I don't have a Pony
login, so I used the "reply via mail client" button.

-chad


On Thu, Sep 19, 2019 at 1:44 PM Valentyn Tymofieiev 
wrote:

> There was a previous thread on this topic[1]. Should we continue this
> conversation there?
>
> https://lists.apache.org/thread.html/eba6caa58ea79a7ecbc8560d1c680a366b44c531d96ce5c699d41535@%3Cdev.beam.apache.org%3E
>
> On Thu, Sep 19, 2019 at 12:55 PM Chad Dombrova  wrote:
>
>> Hi all,
>> I saw it mentioned on another thread that Beam will drop python2 support
>> by the end of the year, and I'd like to voice my concern over this
>> timeline.  As far as I can tell, Beam's support for python3 is brand new,
>> and based on the master Jira ticket on this topic [1], there are still at
>> least a dozen *known* issues remaining to be resolved.  If we assume it
>> takes another month to resolve all of those, and python2 support is dropped
>> at the end of the year, that leaves a window of barely over 2 months where
>> Beam is fully working for both python versions.  I think that will be an
>> uncomfortably short window for some users to transition their production
>> pipelines to Beam on python3, my company included.  Of course, users can
>> choose to stay on older versions, but with so many important features still
>> under active development (portability, expansion, external IO transforms,
>> schema coders) and new versions of executors tied to the Beam source,
>> staying behind is not really an option for many of us.
>>
>> So I'm hoping we could extend support for python2 for a bit longer, if
>> possible.
>>
>> I'm curious who is using Beam on python3 in production, and for which
>> runners?
>>
>> thanks,
>> -chad
>>
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-1251
>>
>>


Re: Plan for dropping python 2 support

2019-09-19 Thread Chad Dombrova
Hi all,
I had a read through this thread in the archives. It occurred before I
joined the mailing list, so I hope that this email connects up with the
thread properly for everyone.

I'd like to respond to the following points:

I believe we are referring to two separate things with support:
> - Supporting existing releases for patches - I agree that we need to give
> users a long enough window to upgrade. Great if it happens with an LTS
> release. Even if it does not, I think it will be fair to offer patches on
> the last python 2 supporting release during some part of 2020 if that
> becomes necessary.
> - Making new releases with python 2 support - Each new Beam release with
> python 2 support will implicitly extend the lifetime of beam's python 2
> support. I do not think we need to extend this to beyond 2019. 2 releases
> (~ 3 months) after solid python 3 support will very likely put the last
> python 2 supporting release to last quarter of 2019 already.


With so many important features still under active development
(portability, expansion, external IO transforms, schema coders) and new
versions of executors tied to the Beam source, staying behind is not really
an option for many of us, and with python3 support not yet fully completed,
the window in which Beam is fully working for both python versions is
rapidly approaching 2 months, and could ultimately be even less, depending
on how long it takes to complete the dozen remaining issues in Jira, and
whatever pops up thereafter.

The cost of maintaining Python 2.7 support is higher than 0. Some issues
> that come to mind:
> - Maintaining Py2.7 / Py 3+ compatibility of Beam codebase makes it
> difficult to use Python 3 syntax in Beam which may be necessary to support
> and test syntactic constructs introduced in Python 3.
> - Running additional test suites increases the load on test infrastructure
> and increases flakiness.


I would argue that the cost of maintaining a python2-only LTS version will
be far greater than maintaining python2 support for a little while longer.
Dropping support for python2 could mean a number of things from simply
disabling the python2 tests, to removing 2-to-3 idioms in favor of
python3-only constructs.  If what you have in mind is anything like the
latter then the master branch will become quite divergent from the LTS
release, and backporting changes will be not be as simple as cherry-picking
commits.  All-in-all, I think it's a lose/lose for everyone -- users and
developers, of which I am both -- to drop python2 support on such a short
timeline.

I'm an active contributor to this project and it will put me and the
company that I work for in a very bad position if you force us onto an LTS
release in early 2020.  I understand the appeal of moving to python3-only
code and I want to get there too, but I would hope that you give your users
are much time to transition their own code as the Beam project itself has
taken.  I'm not asking for a full 12 months to transition, but more than a
couple will be required.

thanks,
-chad


When will Beam drop support for python2?

2019-09-19 Thread Chad Dombrova
Hi all,
I saw it mentioned on another thread that Beam will drop python2 support by
the end of the year, and I'd like to voice my concern over this timeline.
As far as I can tell, Beam's support for python3 is brand new, and based on
the master Jira ticket on this topic [1], there are still at least a dozen
*known* issues remaining to be resolved.  If we assume it takes another
month to resolve all of those, and python2 support is dropped at the end of
the year, that leaves a window of barely over 2 months where Beam is fully
working for both python versions.  I think that will be an uncomfortably
short window for some users to transition their production pipelines to
Beam on python3, my company included.  Of course, users can choose to stay
on older versions, but with so many important features still under active
development (portability, expansion, external IO transforms, schema coders)
and new versions of executors tied to the Beam source, staying behind is
not really an option for many of us.

So I'm hoping we could extend support for python2 for a bit longer, if
possible.

I'm curious who is using Beam on python3 in production, and for which
runners?

thanks,
-chad


[1] https://issues.apache.org/jira/browse/BEAM-1251


Re: Using Beam Built-in I/O Transforms with an external framework.

2019-09-18 Thread Chad Dombrova
Hi Pulasthi,
Just to mirror what Cham said, it would be a non-starter to try to use a
Beam IO source in another framework: to make them work, you'd have to build
something that executes them with their expected protocol, and that would
look an awful lot like a Beam runner.  It makes more sense to think about
the problem in reverse:  how do you make a transform (or suite of
transforms) that calls out to your Twister2 framework.  That's something
that's hard to answer without knowing more about the framework, but
splittable DoFns may be a good place to start.  But honestly, if I were you
I'd just invest the time in porting to Beam.  We've recently gone through a
similar process and it's well worth the effort.

-chad








On Wed, Sep 18, 2019 at 2:17 PM Chamikara Jayalath 
wrote:

> Hi Pulasthi,
>
> This might be possible but I don't know if anybody has done this. API of
> Beam sources are no different from other Beam PTransforms and we highly
> recommend hiding away various implementations of source framework related
> abstractions in a composite transform [1]. So what you are looking for is a
> way to use a Beam transform in a separate system.
>
> Thanks,
> Cham
>
> [1]
> https://beam.apache.org/contribute/ptransform-style-guide/#exposing-a-ptransform-vs-something-else
>
> On Wed, Sep 18, 2019 at 1:52 PM Pulasthi Supun Wickramasinghe <
> pulasthi...@gmail.com> wrote:
>
>> Hi Dev's
>>
>> We have a big data processing framework named Twister2, and wanted to
>> know if there is any way we could leverage the I/O Transforms that are
>> built into Apache Beam externally. That is rather than using it in a Beam
>> pipeline just use them as data sources in our project. Just wanted to check
>> with the dev's if such an approach has been done by anyone before, so we
>> could get some pointers on how this could be done. Any pointers in the
>> right direction would be highly appreciated.
>>
>> Best Regards,
>> Pulasthi
>>
>> --
>> Pulasthi S. Wickramasinghe
>> PhD Candidate  | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> cell: 224-386-9035 <(224)%20386-9035>
>>
>


  1   2   >