Re: Python SDK Arrow Integrations

2019-03-27 Thread Kenneth Knowles
Thinking about Arrow + Beam SQL + schemas:

 - Obviously many SQL operations could be usefully accelerated by arrow /
columnar. Especially in the analytical realm this is the new normal. For
ETL, perhaps less so.

 - Beam SQL planner (pipeline construction) is implemented in Java, and so
the various DoFns/CombineFns that implement projection, filter, etc, are
also in Java.
- Arrow is of course available in Java.
- Presumably there is a pandas counterpart in Java. Is there? Do you
know?
- But perhaps if these building blocks emitted by the planner had
well-defined URNs we could use SIMD+columnar Java or Python implementation
opportunistically to avoid cross-language data channels. (thinking ahead to
when cross-language allows Python pipelines to invoke the construction-time
planner implemented in Java)

 - Is it valuable for Beam to invent its own schemas? I'd love for Beam to
have identical schema affordances to either protobuf or arrow or avro, with
everything layered on that as logical types (including SQL types). What
would it look like if Beam schemas were more-or-less Arrow schemas?

 - For the event timestamp issue, there are two levels of abstraction I
could imagine improvements:
- at the model layer (aka portability protos) we could make Beam
columnar batch aware. That's a huge change and would need a massive
justification IMO in the form of performance numbers.
- at the SDK layer, some language might make it pretty easy to overload
the "GroupByKey" transform to understand that for elements that are really
batches there are multiple timestamps contained within so it may need to
window & group differently. The model doesn't need to know in this case.

Kenn

On Wed, Mar 27, 2019 at 4:42 PM Ahmet Altay  wrote:

> Thank you Brian, this looks promising.
>
> cc: +Chamikara Jayalath  +Heejong Lee
> 
>
> On Wed, Mar 27, 2019 at 1:22 PM Brian Hulette  wrote:
>
>> Hi everyone,
>> I've been doing some investigations into how Arrow might fit into Beam as
>> a way to ramp up on the project. As I've gone about this I've prototyped a
>> couple of additions to the Python SDK. I think these additions may be
>> useful for others so I'm considering cleaning them up and submitting PRs,
>> but I wanted to have a discussion here to see if it makes sense first.
>>
>> Note that the approach I'm using for this work right now is very naive.
>> I've just built pipelines where individual elements are actually arrow
>> record batches (or pandas DataFrames). This is really only acceptable for
>> bounded pipelines without windowing, since it's impossible to define a
>> single event time for each element. That being said, I think these tools
>> could still be useful for people who want to run batch pipelines using
>> parquet, arrow, and pandas.
>>
>
> I agree these will be generally useful.
>
>
>>
>> Here's what I've implemented so far:
>> # An option for io.ReadFromParquet to yield arrow record batches instead
>> of individual elements
>> Currently the python SDK's parquet reader uses pyarrow.parquet to read
>> parquet row groups into arrow record batches, and then splits the batches
>> into a single dictionary per row [1]. I've added a flag to optionally
>> short-circuit this and just yield the arrow record batches directly, making
>> it easier for me to build pipelines that process columnar batches. If I
>> were to contribute this change I could also split out the record batch <->
>> dictionary conversions as separate transforms, since they could be
>> generally useful as well.
>>
>
> I think splitting to new transforms rather that adding new options to
> existing IO transforms would be simpler for users.  I think this would be a
> question that could be easier to answer with a PR.
>
>
>> # Custom coders for Arrow Tables and Record Batches
>> I found that the default coder (pickle/dill?) slowed down my arrow
>> pipelines, particularly in the case where a record batch had been sliced
>> into smaller record batches (presumably because the entire original batch
>> is getting serialized for each slice). I put together some coders that
>> encode arrow tables and record batches with arrow's IPC formats, which
>> improves performance substantially.
>>
>
> How did you measure this? It would be good for us to also have relevant
> micro benchmarks here.
>
>
>>
>> Would it make sense to add these things to the Python SDK? Or would they
>> fit better in a separate library of utilities for building pipelines with
>> batched data?
>>
>
> +1 for adding to Beam. Python SDK has a list of utility transforms
> (e.g. BatchElements), new additions could also live in that space.
>
>
>>
>> Brian
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/parquetio.py#L239
>>
>


Re: NullPointerException - Session windows with Lateness in FlinkRunner

2019-03-27 Thread rahul patwari
+dev

On Wed 27 Mar, 2019, 9:47 PM rahul patwari, 
wrote:

> Hi,
> I am using Beam 2.11.0, Runner - beam-runners-flink-1.7, Flink Cluster -
> 1.7.2.
>
> I have this flow in my pipeline:
> KafkaSource(withCreateTime())  -->  ApplyWindow(SessionWindow with
> gapDuration=1 Minute, lateness=3 Minutes, AccumulatingFiredPanes, default
> trigger)  -->  BeamSQL(GroupBy query)  -->  Window.remerge()  -->
> Enrichment  -->  KafkaSink
>
> I am generating data in such a way that the first two records belong to
> two different sessions. And, generating the third record before the first
> session expires with the timestamp for the third record in such a way that
> the two sessions will be merged to become a single session.
>
> For Example, These are the sample input and output obtained when I ran the
> same pipeline in DirectRunner.
>
> Sample Input:
> {"Col1":{"string":"str1"}, "Timestamp":{"string":"2019-03-27 15-27-44"}}
> {"Col1":{"string":"str1"}, "Timestamp":{"string":"2019-03-27 15-28-51"}}
> {"Col1":{"string":"str1"}, "Timestamp":{"string":"2019-03-27 15-28-26"}}
>
> Sample Output:
> {"Col1":{"string":"str1"},"NumberOfRecords":{"long":1},"WST":{"string":"2019-03-27
> 15-27-44"},"WET":{"string":"2019-03-27 15-28-44"}}
> {"Col1":{"string":"str1"},"NumberOfRecords":{"long":1},"WST":{"string":"2019-03-27
> 15-28-51"},"WET":{"string":"2019-03-27 15-29-51"}}
> {"Col1":{"string":"str1"},"NumberOfRecords":{"long":3},"WST":{"string":"2019-03-27
> 15-27-44"},"WET":{"string":"2019-03-27 15-29-51"}}
>
> Where "NumberOfRecords" is the count, "WST" is the Avro field Name which
> indicates the window start time for the session window. Similarly "WET"
> indicates the window End time of the session window. I am getting "WST" and
> "WET" after remerging and applying ParDo(Enrichment stage of the pipeline).
>
> The program ran successfully in DirectRunner. But, in FlinkRunner, I am
> getting this exception when the third record arrives:
>
> 2019-03-27 15:31:00,442 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph- Source:
> DfleKafkaSource/KafkaIO.Read/Read(KafkaUnboundedSource) -> Flat Map ->
> DfleKafkaSource/ParDo(ConvertKafkaRecordtoRow)/ParMultiDo(ConvertKafkaRecordtoRow)
> -> (Window.Into()/Window.Assign.out ->
> DfleSql/SqlTransform/BeamIOSourceRel_4/Convert.ConvertTransform/ParDo(Anonymous)/ParMultiDo(Anonymous)
> ->
> DfleSql/SqlTransform/BeamAggregationRel_45/Group.CombineFieldsByFields/Group.ByFields/Group
> by fields/ParMultiDo(Anonymous) -> ToKeyedWorkItem,
> DfleKafkaSink/ParDo(RowToGenericRecordConverter)/ParMultiDo(RowToGenericRecordConverter)
> -> DfleKafkaSink/KafkaIO.KafkaValueWrite/Kafka values with default
> key/Map/ParMultiDo(Anonymous) ->
> DfleKafkaSink/KafkaIO.KafkaValueWrite/KafkaIO.Write/Kafka
> ProducerRecord/Map/ParMultiDo(Anonymous) ->
> DfleKafkaSink/KafkaIO.KafkaValueWrite/KafkaIO.Write/KafkaIO.WriteRecords/ParDo(KafkaWriter)/ParMultiDo(KafkaWriter))
> (1/1) (d00be62e110cef00d9d772042f4b87a9) switched from DEPLOYING to RUNNING.
> 2019-03-27 15:33:25,427 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph-
> DfleSql/SqlTransform/BeamAggregationRel_45/Group.CombineFieldsByFields/Group.ByFields/GroupByKey
> ->
> DfleSql/SqlTransform/BeamAggregationRel_45/Group.CombineFieldsByFields/Combine.GroupedValues/ParDo(Anonymous)/ParMultiDo(Anonymous)
> ->
> DfleSql/SqlTransform/BeamAggregationRel_45/mergeRecord/ParMultiDo(Anonymous)
> -> DfleSql/SqlTransform/BeamCalcRel_46/ParDo(Calc)/ParMultiDo(Calc) ->
> DfleSql/Window.Remerge/Identity/Map/ParMultiDo(Anonymous) ->
> DfleSql/ParDo(EnrichRecordWithWindowTimeInfo)/ParMultiDo(EnrichRecordWithWindowTimeInfo)
> ->
> DfleKafkaSink2/ParDo(RowToGenericRecordConverter)/ParMultiDo(RowToGenericRecordConverter)
> -> DfleKafkaSink2/KafkaIO.KafkaValueWrite/Kafka values with default
> key/Map/ParMultiDo(Anonymous) ->
> DfleKafkaSink2/KafkaIO.KafkaValueWrite/KafkaIO.Write/Kafka
> ProducerRecord/Map/ParMultiDo(Anonymous) ->
> DfleKafkaSink2/KafkaIO.KafkaValueWrite/KafkaIO.Write/KafkaIO.WriteRecords/ParDo(KafkaWriter)/ParMultiDo(KafkaWriter)
> (1/1) (d95af17b7457443c13bd327b46b282e6) switched from RUNNING to FAILED.
> org.apache.beam.sdk.util.UserCodeException: java.lang.NullPointerException
> at
> org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:34)
> at
> org.apache.beam.runners.core.GroupAlsoByWindowViaWindowSetNewDoFn$DoFnInvoker.invokeProcessElement(Unknown
> Source)
> at
> org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:275)
> at
> org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:240)
> at
> org.apache.beam.runners.core.LateDataDroppingDoFnRunner.processElement(LateDataDroppingDoFnRunner.java:80)
> at
> org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:63)
> at
> 

Re: Python SDK Arrow Integrations

2019-03-27 Thread Ahmet Altay
Thank you Brian, this looks promising.

cc: +Chamikara Jayalath  +Heejong Lee


On Wed, Mar 27, 2019 at 1:22 PM Brian Hulette  wrote:

> Hi everyone,
> I've been doing some investigations into how Arrow might fit into Beam as
> a way to ramp up on the project. As I've gone about this I've prototyped a
> couple of additions to the Python SDK. I think these additions may be
> useful for others so I'm considering cleaning them up and submitting PRs,
> but I wanted to have a discussion here to see if it makes sense first.
>
> Note that the approach I'm using for this work right now is very naive.
> I've just built pipelines where individual elements are actually arrow
> record batches (or pandas DataFrames). This is really only acceptable for
> bounded pipelines without windowing, since it's impossible to define a
> single event time for each element. That being said, I think these tools
> could still be useful for people who want to run batch pipelines using
> parquet, arrow, and pandas.
>

I agree these will be generally useful.


>
> Here's what I've implemented so far:
> # An option for io.ReadFromParquet to yield arrow record batches instead
> of individual elements
> Currently the python SDK's parquet reader uses pyarrow.parquet to read
> parquet row groups into arrow record batches, and then splits the batches
> into a single dictionary per row [1]. I've added a flag to optionally
> short-circuit this and just yield the arrow record batches directly, making
> it easier for me to build pipelines that process columnar batches. If I
> were to contribute this change I could also split out the record batch <->
> dictionary conversions as separate transforms, since they could be
> generally useful as well.
>

I think splitting to new transforms rather that adding new options to
existing IO transforms would be simpler for users.  I think this would be a
question that could be easier to answer with a PR.


> # Custom coders for Arrow Tables and Record Batches
> I found that the default coder (pickle/dill?) slowed down my arrow
> pipelines, particularly in the case where a record batch had been sliced
> into smaller record batches (presumably because the entire original batch
> is getting serialized for each slice). I put together some coders that
> encode arrow tables and record batches with arrow's IPC formats, which
> improves performance substantially.
>

How did you measure this? It would be good for us to also have relevant
micro benchmarks here.


>
> Would it make sense to add these things to the Python SDK? Or would they
> fit better in a separate library of utilities for building pipelines with
> batched data?
>

+1 for adding to Beam. Python SDK has a list of utility transforms
(e.g. BatchElements), new additions could also live in that space.


>
> Brian
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/parquetio.py#L239
>


Re: [PROPOSAL] Standardize Gradle structure in Python SDK

2019-03-27 Thread Ahmet Altay
This sounds good to me. Thank you for doing this. Few questions:
- Could you comment on what kind of parallelization we will gain by this?
In terms of real numbers, how would this affect build and test times?
- I am guessing this will reduce complexity. Is it possible to quantify the
improvement related to this?
- Beyond the proposal, I am assuming you are willing to work on. Just want
to clarify this. In either case, would you need help?

Thank you,
Ahmet

On Wed, Mar 27, 2019 at 10:19 AM Mark Liu  wrote:

> Hi Python SDK Developers,
>
> You may notice that Gradle files changed a lot recently as parallelization
>  applied to
> Python tests and more python versions were enabled in testing. There are
> tricks over the build scripts and tests are grown naturally and distributed
> under sdks/python, which caused frictions (like rollback PR-8059
> ).
>
> Thus, I created BEAM-6907
>  and would like to
> initiate some works to cleanup and standardize Gradle structure in Python
> SDK. In general, I think we want to:
>
> - Apply parallel execution
> - Share common tasks
> - Centralize test related tasks
> - Have a clear Gradle structure for projects/tasks
>
> This is Gradle directory structure I proposed:
>
> sdks/python/
>
> build.gradle--> hold builds, snapshot, analytic tasks
> test-suites/--> all pre/post/VR test suites under here
>
> README.md
>
> dataflow/--> grouped by runner or unit test (tox)
>
> py27/--> grouped by py version
>
> build.gradle
>
> py35/
>
> ...
>
> direct/
>
> py27/
>
> ...
>
> flink/
>
> tox/
> ...
>
>
> The ideas are:
> - Only keep builds, snapshot and analytic jobs in sdks/python/build.gradle
> - Move all test related tasks to sdks/python/test-suites/
> - In sdks/python/test-suites, we first group by runners, unit test or
> other testing that can't fit to them, and then group by py versions if
> needed.
> - An example of ../test-suites/../py35/build.gradle is this
> 
> .
>
> Please feel free to explore existing Gradle scripts in Python SDK and
> bring any thoughts on this proposal if you have.
>
> Thanks!
> Mark
>


Re: Frequent failures on beam8

2019-03-27 Thread Mikhail Gryzykhin
And another one.
beam14 OOMs

On Mon, Mar 25, 2019 at 5:54 PM Yifan Zou  wrote:

> the beam8 is disabled by now.
>
> On Mon, Mar 25, 2019 at 2:06 PM Mikhail Gryzykhin 
> wrote:
>
>> Yifan is looking into this.
>>
>> On Mon, Mar 25, 2019 at 1:55 PM Boyuan Zhang  wrote:
>>
>>> Hey all,
>>>
>>> Could anyone help take a look at beam8
>>> ? Seems like there are
>>> many tests failed on beam8 owing to infra problems.
>>>
>>> Thanks!
>>>
>>


Re: Deprecating Avro for fastavro on Python 3

2019-03-27 Thread Chamikara Jayalath
+1 for making use_fastavro the default for Python3. I don't see any
significant drawbacks in doing this from Beam's point of view. One concern
is whether avro and fastavro can safely co-exist in the same environment so
that Beam continues to work for users who already have avro library
installed.

Note that there are two use_fastavro flags (confusingly enough).
(1) for avro file source [1]
(2) an experiment flag [2] with the same name that makes Dataflow runner
use fastavro library for reading/writing intermediate files and for reading
Avro files exported by BigQuery.

I can help with the latter.

[1]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/avroio.py#L81
[2]
https://lists.apache.org/thread.html/94bd362a3a041654e6ef9003fb3fa797e25274fdb4766065481a0796@%3Cuser.beam.apache.org%3E

Thanks,
Cham

On Wed, Mar 27, 2019 at 3:27 PM Valentyn Tymofieiev 
wrote:

> Thanks, Robbe and Frederik, for raising this.
>
> Over the course of making Beam Python 3 compatible this is at least the
> second time [1] we have to deal with an error in avro-python3 package. The
> release cadence of Apache Avro (1 release a year)
> is concerning to me [2]. Even if we have a new release with Python 3 fixes
> soon, as Beam users start use Beam more actively on Python 3, we may
> encounter more issues in avro-python3. If this happens, Beam will have to
> monkey-patch its way around the avro-python3 issues, because waiting for
> next Avro release may not be practical.
>
> So, I agree that it is be a good time to start transitioning off of
> avro/avro-python3 dependency, given that fastavro is known to be a faster
> alternative [3], and is released monthly[4]
>
> There are couple of ways to make this transition depending on how careful
> we want to be. We should:
>
> 1. Remove the dependency on avro in the current codepath whenever fastavro
> is used, as you propose.
> 2. Remove Beam dependency on avro-python3 now,  OR,  if we want to be
> safer,  set use_fastavro=True a default option on Python 3, but keep the
> dependency on avro-python3, and keep that codepath, even though it may not
> work right now on Py3, but might work after next Avro release.
> 3. set use_fastavro=True a default option on Python 2.
> 4. Remove Beam dependency on avro and avro-python3 after several releases.
>
> Adding +Chamikara Jayalath  and +Udi Meiri
>  who have been working on Beam IOs may have some
> thoughts here. Do you think that it is safe to make use_fastavro=True a
> default option for both Py2 and Py3 now? If we make use_fastavro a default
> option on Py3, do you think there is a benefit to still keep the Avro
> codepath on Py3, or we can remove it?
>
> Thanks,
> Valentyn
>
> [1] https://github.com/apache/avro/pull/436
> [2] https://avro.apache.org/releases.html
> [3]
> https://medium.com/@abrarsheikh/benchmarking-avro-and-fastavro-using-pytest-benchmark-tox-and-matplotlib-bd7a83964453
> [4] https://pypi.org/project/fastavro/#history
>
> On Wed, Mar 27, 2019 at 10:49 AM Robbe Sneyders 
> wrote:
>
>> Hi all,
>>
>> We're looking at fixing avroio on Python 3, which still fails due to a
>> non-picklable schema class in Avro [1]. This is fixed when using the latest
>> Avro master, but the last release dates back to May 2017.
>>
>> Fastavro does not have the same problem, but is currently also failing
>> due to a dependency of avroio on Avro for schema parsing.
>>
>> We would therefore propose to (temporarily?) deprecate Avro on Python 3,
>> and implement a pure fastavro solution instead. +Frederik Bode
>>   already submitted a PR for this [2].
>>
>> Use of fastavro is currently activated with the `use_fastavro` flag,
>> which defaults to False. Since this flag would not make sense anymore on
>> Python 3, we would like to switch the default value to True. The
>> documentation already mentions that this will probably become the default
>> on the long term, but this change would also impact Python 2. Is this a
>> problem?
>>
>> Also, looking at the performance gain of fastavro, is there any reason to
>> not deprecate Avro in favor of fastavro on Python 3 indefinitely?
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-6522#comment-16784499
>> [2] https://github.com/apache/beam/pull/8130
>>
>> Kind regards,
>> Robbe
>>
>


Re: Deprecating Avro for fastavro on Python 3

2019-03-27 Thread Valentyn Tymofieiev
Thanks, Robbe and Frederik, for raising this.

Over the course of making Beam Python 3 compatible this is at least the
second time [1] we have to deal with an error in avro-python3 package. The
release cadence of Apache Avro (1 release a year)
is concerning to me [2]. Even if we have a new release with Python 3 fixes
soon, as Beam users start use Beam more actively on Python 3, we may
encounter more issues in avro-python3. If this happens, Beam will have to
monkey-patch its way around the avro-python3 issues, because waiting for
next Avro release may not be practical.

So, I agree that it is be a good time to start transitioning off of
avro/avro-python3 dependency, given that fastavro is known to be a faster
alternative [3], and is released monthly[4]

There are couple of ways to make this transition depending on how careful
we want to be. We should:

1. Remove the dependency on avro in the current codepath whenever fastavro
is used, as you propose.
2. Remove Beam dependency on avro-python3 now,  OR,  if we want to be
safer,  set use_fastavro=True a default option on Python 3, but keep the
dependency on avro-python3, and keep that codepath, even though it may not
work right now on Py3, but might work after next Avro release.
3. set use_fastavro=True a default option on Python 2.
4. Remove Beam dependency on avro and avro-python3 after several releases.

Adding +Chamikara Jayalath  and +Udi Meiri
 who have been working on Beam IOs may have some thoughts
here. Do you think that it is safe to make use_fastavro=True a default
option for both Py2 and Py3 now? If we make use_fastavro a default option
on Py3, do you think there is a benefit to still keep the Avro codepath on
Py3, or we can remove it?

Thanks,
Valentyn

[1] https://github.com/apache/avro/pull/436
[2] https://avro.apache.org/releases.html
[3]
https://medium.com/@abrarsheikh/benchmarking-avro-and-fastavro-using-pytest-benchmark-tox-and-matplotlib-bd7a83964453
[4] https://pypi.org/project/fastavro/#history

On Wed, Mar 27, 2019 at 10:49 AM Robbe Sneyders 
wrote:

> Hi all,
>
> We're looking at fixing avroio on Python 3, which still fails due to a
> non-picklable schema class in Avro [1]. This is fixed when using the latest
> Avro master, but the last release dates back to May 2017.
>
> Fastavro does not have the same problem, but is currently also failing due
> to a dependency of avroio on Avro for schema parsing.
>
> We would therefore propose to (temporarily?) deprecate Avro on Python 3,
> and implement a pure fastavro solution instead. +Frederik Bode
>   already submitted a PR for this [2].
>
> Use of fastavro is currently activated with the `use_fastavro` flag, which
> defaults to False. Since this flag would not make sense anymore on Python
> 3, we would like to switch the default value to True. The documentation
> already mentions that this will probably become the default on the long
> term, but this change would also impact Python 2. Is this a problem?
>
> Also, looking at the performance gain of fastavro, is there any reason to
> not deprecate Avro in favor of fastavro on Python 3 indefinitely?
>
> [1] https://issues.apache.org/jira/browse/BEAM-6522#comment-16784499
> [2] https://github.com/apache/beam/pull/8130
>
> Kind regards,
> Robbe
>


Re: New contributor

2019-03-27 Thread Kenneth Knowles
Welcome!

On Wed, Mar 27, 2019 at 2:59 PM Mikhail Gryzykhin  wrote:

> Welcome Niklas.
>
> This is another location with useful resources for contributors:
> https://cwiki.apache.org/confluence/display/BEAM/Developer+Guides (contributor
> guide has link to this as well though)
>
> On Wed, Mar 27, 2019 at 10:54 AM Connell O'Callaghan 
> wrote:
>
>> Welcome Niklas - given your background it will be very interesting to
>> see your contributions.
>>
>> On Wed, Mar 27, 2019 at 10:29 AM Mark Liu  wrote:
>>
>>> Welcome!
>>>
>>> Mark
>>>
>>> On Wed, Mar 27, 2019 at 10:09 AM Lukasz Cwik  wrote:
>>>
 Welcome. The getting started[1] and contribution guides[2] are most
 useful. I have also added you as a contributor to the JIRA project.

 1: https://beam.apache.org/get-started/beam-overview/
 2: https://beam.apache.org/contribute/

 On Wed, Mar 27, 2019 at 9:38 AM Niklas Hansson <
 niklas.sven.hans...@gmail.com> wrote:

> Hi!
>
> I work as a data scientist within banking but will switch over to
> manufacturing the next month. I would like to contribute to Beam and
> especially the Python SDK. Could you add me as a contributor?
>
> I am new to open source contribution so feel free to give me any
> advice or point me in the right direction. Plan to start off with some of
> the starter tasks from the Jira board.
>
> Best regards
> Niklas
>



Re: New contributor

2019-03-27 Thread Mikhail Gryzykhin
Welcome Niklas.

This is another location with useful resources for contributors:
https://cwiki.apache.org/confluence/display/BEAM/Developer+Guides (contributor
guide has link to this as well though)

On Wed, Mar 27, 2019 at 10:54 AM Connell O'Callaghan 
wrote:

> Welcome Niklas - given your background it will be very interesting to see
> your contributions.
>
> On Wed, Mar 27, 2019 at 10:29 AM Mark Liu  wrote:
>
>> Welcome!
>>
>> Mark
>>
>> On Wed, Mar 27, 2019 at 10:09 AM Lukasz Cwik  wrote:
>>
>>> Welcome. The getting started[1] and contribution guides[2] are most
>>> useful. I have also added you as a contributor to the JIRA project.
>>>
>>> 1: https://beam.apache.org/get-started/beam-overview/
>>> 2: https://beam.apache.org/contribute/
>>>
>>> On Wed, Mar 27, 2019 at 9:38 AM Niklas Hansson <
>>> niklas.sven.hans...@gmail.com> wrote:
>>>
 Hi!

 I work as a data scientist within banking but will switch over to
 manufacturing the next month. I would like to contribute to Beam and
 especially the Python SDK. Could you add me as a contributor?

 I am new to open source contribution so feel free to give me any advice
 or point me in the right direction. Plan to start off with some of the
 starter tasks from the Jira board.

 Best regards
 Niklas

>>>


Debugging :beam-sdks-java-io-hadoop-input-format:test

2019-03-27 Thread Mikhail Gryzykhin
Hi everyone,

I have a pre-commit job that fails on
*:beam-sdks-java-io-hadoop-input-format:test*
.
Relevant PR. 

Target doesn't have any explicit log associated with it. Running same
target in local doesn't give me much help. It seem to fail somewhere in
native runtime.

Can someone help with tackling this issue?

Regards,
Mikhail.


Python SDK Arrow Integrations

2019-03-27 Thread Brian Hulette
Hi everyone,
I've been doing some investigations into how Arrow might fit into Beam as a
way to ramp up on the project. As I've gone about this I've prototyped a
couple of additions to the Python SDK. I think these additions may be
useful for others so I'm considering cleaning them up and submitting PRs,
but I wanted to have a discussion here to see if it makes sense first.

Note that the approach I'm using for this work right now is very naive.
I've just built pipelines where individual elements are actually arrow
record batches (or pandas DataFrames). This is really only acceptable for
bounded pipelines without windowing, since it's impossible to define a
single event time for each element. That being said, I think these tools
could still be useful for people who want to run batch pipelines using
parquet, arrow, and pandas.

Here's what I've implemented so far:
# An option for io.ReadFromParquet to yield arrow record batches instead of
individual elements
Currently the python SDK's parquet reader uses pyarrow.parquet to read
parquet row groups into arrow record batches, and then splits the batches
into a single dictionary per row [1]. I've added a flag to optionally
short-circuit this and just yield the arrow record batches directly, making
it easier for me to build pipelines that process columnar batches. If I
were to contribute this change I could also split out the record batch <->
dictionary conversions as separate transforms, since they could be
generally useful as well.

# Custom coders for Arrow Tables and Record Batches
I found that the default coder (pickle/dill?) slowed down my arrow
pipelines, particularly in the case where a record batch had been sliced
into smaller record batches (presumably because the entire original batch
is getting serialized for each slice). I put together some coders that
encode arrow tables and record batches with arrow's IPC formats, which
improves performance substantially.


Would it make sense to add these things to the Python SDK? Or would they
fit better in a separate library of utilities for building pipelines with
batched data?

Brian

[1]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/parquetio.py#L239


Re: Build blocking on

2019-03-27 Thread Robert Burke
Again very valid concerns! I wouldn't take that step lightly (eg. testing
every single go using gradle task we have, if not testing that explicit
case).

A happier path would be that gogradle "just works" with go modules, and we
can avoid the whole awkward double lockfile state, and version mismatches
between what GoGradle is using and testing on Jenkins vs the Pipeline
Author experience.

I'm not sure, and I won't be convinced until I have time to try it. I'm
just skeptical they do since the issue to support using them
 has
received no traction in the half a year it's been filed.

Given that GoGradle by it's nature is manually invoking the Go Tool, it's
entirely possible the step is to avoid the gradle lock file, and the built
in vendoring commands, and just switch to Go Modules, and since our usage
is very simple, as the go tool will continue to do it's own state
management. Again, it needs testing and that takes time.


On Tue, 26 Mar 2019 at 15:18, Michael Luckey  wrote:

> Of course, we could implement something here. But I am worried about the
> consequences. As gogradle writes into (user) global state this would have
> unexpected side effects.
>
> Consider a developer running Project A - which happens to also use
> gogradle - and during build of that project issues an innocent beam clean.
> Would be no fun to track all that failures arising out of those now deleted
> state.
>
> Unfortunately, I do not have a clear understanding yet, what's going on
> here, why that cache is put there and how it gets inconsistent. Need to
> look into that. (But might be obsoleted anyway by Roberts plans to drop
> gogradle for something different.)
>
> On Tue, Mar 26, 2019 at 7:28 PM Thomas Weise  wrote:
>
>> Can this be addressed by having "clean" remove all state that gogradle
>> leaves behind? This staleness issue has bitten me a few times also and it
>> would be good to have a reliable way to deal with it, even if it involves
>> an extra clean.
>>
>>
>> On Tue, Mar 26, 2019 at 11:14 AM Michael Luckey 
>> wrote:
>>
>>> @Udi
>>> Did you try to just delete the
>>> '/usr/local/google/home/ehudm/.gradle/go/repo/cloud.google.com' folder?
>>>
>>> @Robert
>>> As said before, I am a bit scared about the implications. Shelling out
>>> is done by python, and from build perspective, this does not work very
>>> well, unfortunately. I.e. no caching, up-to-date checks etc...
>>>
>>> But of course, we need to play with this a bit more.
>>>
>>> On Tue, Mar 26, 2019 at 6:24 PM Robert Burke  wrote:
>>>
 Reading the error from the gradle scan, it largely looks like some part
 of the GCP dependencies for the build depends on a package, where the
 commit version is no longer around. The main issue with gogradle is that
 it's entirely distinct from the usual Go workflow, which means deps users
 use are likely to be different to what's in the lock file.

 This work will be tracked in
 https://issues.apache.org/jira/browse/BEAM-5379
 GoGradle hasn't moved to support the new-go way of handling deps, so my
 inclination is to simplify to simple scripts for Gradle that shell out the
 to Go tool for handling Go dep management, over trying to fix GoGradle.

 On Tue, 26 Mar 2019 at 09:43, Udi Meiri  wrote:

> Robert, from what I recall it's not flaky for me - it consistently
> fails. Let me know if there's a way to get more logging about this error.
>
> On Mon, Mar 25, 2019, 19:50 Robert Burke  wrote:
>
>> It's concerning to me that 1) the Go dependency resolution via
>> gogradle is flaky, and 2) that it can block other languages.
>>
>> I suppose 2) makes sense since it's part of the container
>> bootstrapping code, but that makes 1) a serious problem, of which I 
>> wasn't
>> aware.
>> I should have time to investigate this in the next two weeks.
>>
>> On Mon, 25 Mar 2019 at 18:08, Michael Luckey 
>> wrote:
>>
>>> Just for the record,
>>>
>>> using a vm here, because did not yet get all task running on my mac,
>>> and did not want to mess with my setup.
>>>
>>> So installed vanilla ubuntu-18.04 LTS on virtual box, 26GB ram, 6
>>> cores and further
>>>
>>> sudo apt update
>>>
>>> sudo apt install gcc
>>>
>>> sudo apt install make
>>>
>>> sudo apt install perl
>>>
>>> sudo apt install curl
>>>
>>> sudo apt install openjdk-8-jdk
>>>
>>> sudo apt install python
>>>
>>> sudo apt install -y software-properties-common
>>>
>>> sudo add-apt-repository ppa:deadsnakes/ppa
>>>
>>> sudo apt update
>>>
>>> sudo apt install python3.5
>>>
>>> sudo apt-get install apt-transport-https ca-certificates curl
>>> gnupg-agent software-properties-common
>>>
>>> curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo
>>> apt-key 

Re: New contributor

2019-03-27 Thread Connell O'Callaghan
Welcome Niklas - given your background it will be very interesting to see
your contributions.

On Wed, Mar 27, 2019 at 10:29 AM Mark Liu  wrote:

> Welcome!
>
> Mark
>
> On Wed, Mar 27, 2019 at 10:09 AM Lukasz Cwik  wrote:
>
>> Welcome. The getting started[1] and contribution guides[2] are most
>> useful. I have also added you as a contributor to the JIRA project.
>>
>> 1: https://beam.apache.org/get-started/beam-overview/
>> 2: https://beam.apache.org/contribute/
>>
>> On Wed, Mar 27, 2019 at 9:38 AM Niklas Hansson <
>> niklas.sven.hans...@gmail.com> wrote:
>>
>>> Hi!
>>>
>>> I work as a data scientist within banking but will switch over to
>>> manufacturing the next month. I would like to contribute to Beam and
>>> especially the Python SDK. Could you add me as a contributor?
>>>
>>> I am new to open source contribution so feel free to give me any advice
>>> or point me in the right direction. Plan to start off with some of the
>>> starter tasks from the Jira board.
>>>
>>> Best regards
>>> Niklas
>>>
>>


Deprecating Avro for fastavro on Python 3

2019-03-27 Thread Robbe Sneyders
Hi all,

We're looking at fixing avroio on Python 3, which still fails due to a
non-picklable schema class in Avro [1]. This is fixed when using the latest
Avro master, but the last release dates back to May 2017.

Fastavro does not have the same problem, but is currently also failing due
to a dependency of avroio on Avro for schema parsing.

We would therefore propose to (temporarily?) deprecate Avro on Python 3,
and implement a pure fastavro solution instead. +Frederik Bode
  already submitted a PR for this [2].

Use of fastavro is currently activated with the `use_fastavro` flag, which
defaults to False. Since this flag would not make sense anymore on Python
3, we would like to switch the default value to True. The documentation
already mentions that this will probably become the default on the long
term, but this change would also impact Python 2. Is this a problem?

Also, looking at the performance gain of fastavro, is there any reason to
not deprecate Avro in favor of fastavro on Python 3 indefinitely?

[1] https://issues.apache.org/jira/browse/BEAM-6522#comment-16784499
[2] https://github.com/apache/beam/pull/8130

Kind regards,
Robbe


Re: New contributor

2019-03-27 Thread Mark Liu
Welcome!

Mark

On Wed, Mar 27, 2019 at 10:09 AM Lukasz Cwik  wrote:

> Welcome. The getting started[1] and contribution guides[2] are most
> useful. I have also added you as a contributor to the JIRA project.
>
> 1: https://beam.apache.org/get-started/beam-overview/
> 2: https://beam.apache.org/contribute/
>
> On Wed, Mar 27, 2019 at 9:38 AM Niklas Hansson <
> niklas.sven.hans...@gmail.com> wrote:
>
>> Hi!
>>
>> I work as a data scientist within banking but will switch over to
>> manufacturing the next month. I would like to contribute to Beam and
>> especially the Python SDK. Could you add me as a contributor?
>>
>> I am new to open source contribution so feel free to give me any advice
>> or point me in the right direction. Plan to start off with some of the
>> starter tasks from the Jira board.
>>
>> Best regards
>> Niklas
>>
>


[PROPOSAL] Standardize Gradle structure in Python SDK

2019-03-27 Thread Mark Liu
Hi Python SDK Developers,

You may notice that Gradle files changed a lot recently as parallelization
 applied to
Python tests and more python versions were enabled in testing. There are
tricks over the build scripts and tests are grown naturally and distributed
under sdks/python, which caused frictions (like rollback PR-8059
).

Thus, I created BEAM-6907  and
would like to initiate some works to cleanup and standardize Gradle
structure in Python SDK. In general, I think we want to:

- Apply parallel execution
- Share common tasks
- Centralize test related tasks
- Have a clear Gradle structure for projects/tasks

This is Gradle directory structure I proposed:

sdks/python/

build.gradle--> hold builds, snapshot, analytic tasks
test-suites/--> all pre/post/VR test suites under here

README.md

dataflow/--> grouped by runner or unit test (tox)

py27/--> grouped by py version

build.gradle

py35/

...

direct/

py27/

...

flink/

tox/
...


The ideas are:
- Only keep builds, snapshot and analytic jobs in sdks/python/build.gradle
- Move all test related tasks to sdks/python/test-suites/
- In sdks/python/test-suites, we first group by runners, unit test or other
testing that can't fit to them, and then group by py versions if needed.
- An example of ../test-suites/../py35/build.gradle is this

.

Please feel free to explore existing Gradle scripts in Python SDK and bring
any thoughts on this proposal if you have.

Thanks!
Mark


Re: New contributor

2019-03-27 Thread Lukasz Cwik
Welcome. The getting started[1] and contribution guides[2] are most useful.
I have also added you as a contributor to the JIRA project.

1: https://beam.apache.org/get-started/beam-overview/
2: https://beam.apache.org/contribute/

On Wed, Mar 27, 2019 at 9:38 AM Niklas Hansson <
niklas.sven.hans...@gmail.com> wrote:

> Hi!
>
> I work as a data scientist within banking but will switch over to
> manufacturing the next month. I would like to contribute to Beam and
> especially the Python SDK. Could you add me as a contributor?
>
> I am new to open source contribution so feel free to give me any advice or
> point me in the right direction. Plan to start off with some of the starter
> tasks from the Jira board.
>
> Best regards
> Niklas
>


Re: New contributor

2019-03-27 Thread Ahmet Altay
Welcome!

You user name was already added to JIRA.

On Wed, Mar 27, 2019 at 9:38 AM Niklas Hansson <
niklas.sven.hans...@gmail.com> wrote:

> Hi!
>
> I work as a data scientist within banking but will switch over to
> manufacturing the next month. I would like to contribute to Beam and
> especially the Python SDK. Could you add me as a contributor?
>
> I am new to open source contribution so feel free to give me any advice or
> point me in the right direction. Plan to start off with some of the starter
> tasks from the Jira board.
>
> Best regards
> Niklas
>


New contributor

2019-03-27 Thread Niklas Hansson
Hi!

I work as a data scientist within banking but will switch over to
manufacturing the next month. I would like to contribute to Beam and
especially the Python SDK. Could you add me as a contributor?

I am new to open source contribution so feel free to give me any advice or
point me in the right direction. Plan to start off with some of the starter
tasks from the Jira board.

Best regards
Niklas


Re: New contributor

2019-03-27 Thread Guobao Li
Thank you all you guys!

On Tue, Mar 26, 2019 at 7:20 PM Connell O'Callaghan 
wrote:

> Welcome Guobao!!!
>
> On Tue, Mar 26, 2019 at 11:09 AM Melissa Pashniak 
> wrote:
>
>> Welcome!
>>
>>
>> On Tue, Mar 26, 2019 at 10:17 AM Kenneth Knowles  wrote:
>>
>>> Welcome! Cool project. A lot of code, and thorough experiments.
>>>
>>> Kenn
>>>
>>> On Tue, Mar 26, 2019 at 9:15 AM Chamikara Jayalath 
>>> wrote:
>>>
 Welcome!

 On Tue, Mar 26, 2019 at 8:56 AM Ahmet Altay  wrote:

> Welcome Guobao!
>
> On Tue, Mar 26, 2019 at 7:13 AM Ismaël Mejía 
> wrote:
>
>> Welcome Guobao!
>>
>> Nice that you are joining us. Looking forward for your contributions !
>> Take the time to read the contribution guide
>> https://beam.apache.org/contribute/ and don't hesitate to ask any
>> question you may have.
>>
>> On Tue, Mar 26, 2019 at 2:14 PM Alexey Romanenko
>>  wrote:
>> >
>> > Welcome, Guobao! Great to have you on board!
>> >
>> > Alexey
>> >
>> > On 26 Mar 2019, at 11:49, Guobao Li 
>> wrote:
>> >
>> > Hi all,
>> >
>> > I am Guobao Li from Talend. I am new to Apache Beam. Currently I am
>> working on the implementation of CouchbaseIO and hope to contribute in 
>> more
>> areas in the future.
>> >
>> > I have some Open Source experience. I contributed on extension of a
>> DSL to introduce the architecture of Parameter Server to Apache SystemML
>> [1] as part of GSoC 2018. I earned comittership due to my work. So I’m 
>> very
>> glad to be here and continue contributing to Open Source now on Apache 
>> Beam.
>> >
>> > Regards,
>> > Guobao Li
>> >
>> > [1]
>> https://summerofcode.withgoogle.com/archive/2018/projects/5148916517437440/
>> >
>> >
>>
>