from:"Gleb Kanterov"

Re: [DISCUSSION] Docker based development environment issue

2021-05-21 Thread Gleb Kanterov

Is it possible to mount the Docker socket inside the build-env Docker
container? We run a lot of similar tests in CI, and it always worked:

--mount type=bind,source=/var/run/docker.sock,target=/var/run/docker.sock

On Fri, May 21, 2021 at 12:26 PM Alexey Romanenko 
wrote:

> Hello,
>
> Beam provides a very cool feature to run a local development environment
> via Docker [1]. In the same time, some unit tests require to run Docker
> containers to test against “real” instances (for example,
> ClickHouseIOTest). So, it will end up with “docker-in-docker” issue and
> such tests will fail.
>
> What would be a proper solution for that? Annotate these tests with a
> specific “DockerImageRequired” annotation and skip them when running from
> inside container or something else? Any ideas on this?
>
> Thanks,
> Alexey
>
>
> [1] https://github.com/apache/beam/blob/master/start-build-env.sh

Re: [ANNOUNCE] New PMC Member: Chamikara Jayalath

2021-01-22 Thread Gleb Kanterov

Congratulations!

On Fri, Jan 22, 2021 at 9:29 AM Ismaël Mejía  wrote:

> Congrats Cham, well deserved!
>
>
> On Fri, Jan 22, 2021 at 9:02 AM Michał Walenia 
> wrote:
>
>> Congratulations, Cham! Thanks for your work!
>>
>>
>> On Fri, Jan 22, 2021 at 3:13 AM Charles Chen  wrote:
>>
>>> Congrats Cham!
>>>
>>> On Thu, Jan 21, 2021, 5:39 PM Chamikara Jayalath 
>>> wrote:
>>>
 Thanks everybody :)

 - Cham

 On Thu, Jan 21, 2021 at 5:22 PM Pablo Estrada 
 wrote:

> Yoohoo Cham : )
>
> On Thu, Jan 21, 2021 at 5:20 PM Udi Meiri  wrote:
>
>> Congrats Cham!
>>
>> On Thu, Jan 21, 2021 at 4:25 PM Griselda Cuevas 
>> wrote:
>>
>>> Congratulations Cham!!! Well deserved :)
>>>
>>> On Thu, 21 Jan 2021 at 15:23, Connell O'Callaghan <
>>> conne...@google.com> wrote:
>>>
 Well done Cham!!! Thank you for all your contributions to date!!!


 On Thu, Jan 21, 2021 at 3:18 PM Rui Wang  wrote:

> Congratulations, Cham!
>
> -Rui
>
> On Thu, Jan 21, 2021 at 3:15 PM Robert Bradshaw <
> rober...@google.com> wrote:
>
>> Congratulations, Cham!
>>
>> On Thu, Jan 21, 2021 at 3:13 PM Brian Hulette <
>> bhule...@google.com> wrote:
>>
>>> Great news, congratulations Cham!
>>>
>>> On Thu, Jan 21, 2021 at 3:08 PM Robin Qiu 
>>> wrote:
>>>
 Congratulations, Cham!

 On Thu, Jan 21, 2021 at 3:05 PM Tyson Hamilton <
 tyso...@google.com> wrote:

> Woo! Congrats Cham!
>
> On Thu, Jan 21, 2021 at 3:02 PM Robert Burke <
> rob...@frantil.com> wrote:
>
>> Congratulations! That's fantastic news.
>>
>> On Thu, Jan 21, 2021, 2:59 PM Reza Rokni 
>> wrote:
>>
>>> Congratulations!
>>>
>>> On Fri, Jan 22, 2021 at 6:58 AM Ankur Goenka <
>>> goe...@google.com> wrote:
>>>
 Congrats Cham!

 On Thu, Jan 21, 2021 at 2:57 PM Ahmet Altay <
 al...@google.com> wrote:

> Hi all,
>
> Please join me and the rest of Beam PMC in welcoming
> Chamikara Jayalath as our
> newest PMC member.
>
> Cham has been part of the Beam community from its early
> days and contributed to the project in significant ways, 
> including
> contributing new features and improvements especially related 
> Beam IOs,
> advocating for users, and mentoring new community members.
>
> Congratulations Cham! And thanks for being a part of Beam!
>
> Ahmet
>

>>
>> --
>>
>> Michał Walenia
>> Polidea  | Software Engineer
>>
>> M: +48 791 432 002 <+48791432002>
>> E: michal.wale...@polidea.com
>>
>> Unique Tech
>> Check out our projects! 
>>
>

Re: [ANNOUNCE] New committer: Reza Ardeshir Rokni

2020-09-17 Thread Gleb Kanterov

Congratulations!

On Tue, Sep 15, 2020 at 5:44 PM Ismaël Mejía  wrote:

> Congratulations Reza, well done !
>
> On Mon, Sep 14, 2020 at 10:10 AM Katarzyna Kucharczyk
>  wrote:
> >
> > Congratulations Reza! :)
> >
> > On Mon, Sep 14, 2020 at 10:05 AM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
> >>
> >> Congratulations! Thanks Reza for your contributions!
> >>
> >> On 12 Sep 2020, at 10:00, Jan Lukavský  wrote:
> >>
> >> Congrats Reza!
> >>
> >> On 9/12/20 5:29 AM, Thomas Weise wrote:
> >>
> >> Congratulations!
> >>
> >>
> >> On Fri, Sep 11, 2020, 5:54 PM Andrew Psaltis 
> wrote:
> >>>
> >>> Congrats!
> >>>
> >>> On Sat, Sep 12, 2020 at 7:43 AM Reza Rokni  wrote:
> 
>  Thanx everyone! Looking forward to being able to contribute more :-)
> 
>  On Sat, Sep 12, 2020 at 4:33 AM Valentyn Tymofieiev <
> valen...@google.com> wrote:
> >
> > Congrats!
> >
> > On Thu, Sep 10, 2020 at 8:08 PM Connell O'Callaghan <
> conne...@google.com> wrote:
> >>
> >> Excellent- well done Reza!!!
> >>
> >> On Thu, Sep 10, 2020 at 7:35 PM Austin Bennett <
> whatwouldausti...@gmail.com> wrote:
> >>>
> >>> Thanks and congrats, Reza!
> >>>
> >>> On Thu, Sep 10, 2020 at 5:48 PM Heejong Lee 
> wrote:
> 
>  Congratulations!
> 
>  On Thu, Sep 10, 2020 at 4:42 PM Robert Bradshaw <
> rober...@google.com> wrote:
> >
> > Thank you and welcome, Reza!
> >
> > On Thu, Sep 10, 2020 at 4:00 PM Ahmet Altay 
> wrote:
> >>
> >> Congratulations Reza! And thank you for your contributions!
> >>
> >> On Thu, Sep 10, 2020 at 3:59 PM Chamikara Jayalath <
> chamik...@google.com> wrote:
> >>>
> >>> Congrats Reza!
> >>>
> >>> On Thu, Sep 10, 2020 at 10:35 AM Kenneth Knowles <
> k...@apache.org> wrote:
> 
>  Hi all,
> 
>  Please join me and the rest of the Beam PMC in welcoming a
> new committer: Reza Ardeshir Rokni.
> 
>  Reza has been part of the Beam community since 2017! Reza has
> spearheaded advanced Beam examples [1], blogged and presented at multiple
> Beam Summits. Reza helps out users on the mailing lists [2] and
> StackOverflow [3]. When Reza's work uncovers a missing feature in Beam, he
> adds it [4]. Considering these contributions, the Beam PMC trusts Reza with
> the responsibilities of a Beam committer [5].
> 
>  Thank you, Reza, for your contributions.
> 
>  Kenn
> 
>  [1] https://github.com/apache/beam/pull/3961
>  [2]
> https://lists.apache.org/list.html?u...@beam.apache.org:gte=0d:reza%20rokni
>  [3] https://stackoverflow.com/tags/apache-beam/topusers
>  [4] https://github.com/apache/beam/pull/11929
>  [5]
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
> 
> 
> >>>
> >>>
> >>
> >>
> >
> >
> 
> 
> >>>
> >>>
> >> --
> >> Your feedback welcomed for Connello!!!
> >>
> >>
>

Re: Percentile metrics in Beam

2020-08-18 Thread Gleb Kanterov

Hi Alex,

I'm not sure about restoring histogram, because the use-case I had in the
past used percentiles. As I understand it, you can approximate histogram if
you know percentiles and total count. E.g. 5% of values fall into
[P95, +INF) bucket, other 5% [P90, P95), etc. I don't understand the paper
well enough to say how it's going to work if given bucket boundaries happen
to include a small number of values. I guess it's a similar kind of
trade-off when we need to choose boundaries if we want to get percentiles
from histogram buckets. I see primarily moment sketch as a method intended
to approximate percentiles, not histogram buckets.

/Gleb

On Tue, Aug 18, 2020 at 2:13 AM Alex Amato  wrote:

> Hi Gleb, and Luke
>
> I was reading through the paper, blog and github you linked to. One thing
> I can't figure out is if it's possible to use the Moment Sketch to restore
> an original histogram.
> Given bucket boundaries: b0, b1, b2, b3, ...
> Can we obtain the counts for the number of values inserted each of the
> ranges: [-INF, B0), … [Bi, Bi+1), …
> (This is a requirement I need)
>
> Not be confused with the percentile/threshold based queries discussed in
> the blog.
>
> Luke, were you suggesting collecting both and sending both over the FN API
> wire? I.e. collecting both
>
>- the variables to represent the Histogram as suggested in
>https://s.apache.org/beam-histogram-metrics:
>- In addition to the moment sketch variables
>
> <https://blog.acolyer.org/2018/10/31/moment-based-quantile-sketches-for-efficient-high-cardinality-aggregation-queries/>
>.
>
> I believe that would be feasible, as we would still retain the Histogram
> data. I don't think we can restore the Histograms with just the Sketch, if
> that was the suggestion. Please let me know if I misunderstood.
>
> If that's correct, I can write up the benefits and drawbacks I see for
> both approaches.
>
>
> On Mon, Aug 17, 2020 at 9:23 AM Luke Cwik  wrote:
>
>> That is an interesting suggestion to change to use a sketch.
>>
>> I believe having one metric URN that represents all this information
>> grouped together would make sense instead of attempting to aggregate
>> several metrics together. The underlying implementation of using
>> sum/count/max/min would stay the same but we would want a single object
>> that abstracts this complexity away for users as well.
>>
>> On Mon, Aug 17, 2020 at 3:42 AM Gleb Kanterov  wrote:
>>
>>> Didn't see proposal by Alex before today. I want to add a few more cents
>>> from my side.
>>>
>>> There is a paper Moment-based quantile sketches for efficient high
>>> cardinality aggregation queries [1], a TL;DR that for some N (around 10-20
>>> depending on accuracy) we need to collect SUM(log^N(X)) ... log(X),
>>> COUNT(X), SUM(X), SUM(X^2)... SUM(X^N), MAX(X), MIN(X). Given aggregated
>>> numbers, it uses solver for Chebyshev polynomials to get quantile number,
>>> and there is already Java implementation for it on GitHub [2].
>>>
>>> This way we can express quantiles using existing metric types in Beam,
>>> that can be already done without SDK or runner changes. It can fit nicely
>>> into existing runners and can be abstracted over if needed. I think this is
>>> also one of the best implementations, it has < 1% error rate for 200 bytes
>>> of storage, and quite efficient to compute. Did we consider using that?
>>>
>>> [1]:
>>> https://blog.acolyer.org/2018/10/31/moment-based-quantile-sketches-for-efficient-high-cardinality-aggregation-queries/
>>> [2]: https://github.com/stanford-futuredata/msketch
>>>
>>> On Sat, Aug 15, 2020 at 6:15 AM Alex Amato  wrote:
>>>
>>>> The distinction here is that even though these metrics come from user
>>>> space, we still gave them specific URNs, which imply they have a specific
>>>> format, with specific labels, etc.
>>>>
>>>> That is, we won't be packaging them into a USER_HISTOGRAM urn. That URN
>>>> would have less expectation for its format. Today the USER_COUNTER just
>>>> expects like labels (TRANSFORM, NAME, NAMESPACE).
>>>>
>>>> We didn't decide on making a private API. But rather an API
>>>> available to user code for populating metrics with specific labels, and
>>>> specific URNs. The same API could pretty much be used for user
>>>> USER_HISTOGRAM. with a default URN chosen.
>>>> Thats how I see it in my head at the moment.
>>>>
>>>>
>>>> On Fri, Aug 14, 2020 at 8:52 PM Robert Bradshaw 
>>>

Re: Percentile metrics in Beam

2020-08-17 Thread Gleb Kanterov

Didn't see proposal by Alex before today. I want to add a few more cents
from my side.

There is a paper Moment-based quantile sketches for efficient high
cardinality aggregation queries [1], a TL;DR that for some N (around 10-20
depending on accuracy) we need to collect SUM(log^N(X)) ... log(X),
COUNT(X), SUM(X), SUM(X^2)... SUM(X^N), MAX(X), MIN(X). Given aggregated
numbers, it uses solver for Chebyshev polynomials to get quantile number,
and there is already Java implementation for it on GitHub [2].

This way we can express quantiles using existing metric types in Beam, that
can be already done without SDK or runner changes. It can fit nicely into
existing runners and can be abstracted over if needed. I think this is also
one of the best implementations, it has < 1% error rate for 200 bytes of
storage, and quite efficient to compute. Did we consider using that?

[1]:
https://blog.acolyer.org/2018/10/31/moment-based-quantile-sketches-for-efficient-high-cardinality-aggregation-queries/
[2]: https://github.com/stanford-futuredata/msketch

On Sat, Aug 15, 2020 at 6:15 AM Alex Amato  wrote:

> The distinction here is that even though these metrics come from user
> space, we still gave them specific URNs, which imply they have a specific
> format, with specific labels, etc.
>
> That is, we won't be packaging them into a USER_HISTOGRAM urn. That URN
> would have less expectation for its format. Today the USER_COUNTER just
> expects like labels (TRANSFORM, NAME, NAMESPACE).
>
> We didn't decide on making a private API. But rather an API available to
> user code for populating metrics with specific labels, and specific URNs.
> The same API could pretty much be used for user USER_HISTOGRAM. with a
> default URN chosen.
> Thats how I see it in my head at the moment.
>
>
> On Fri, Aug 14, 2020 at 8:52 PM Robert Bradshaw 
> wrote:
>
>> On Fri, Aug 14, 2020 at 7:35 PM Alex Amato  wrote:
>> >
>> > I am only tackling the specific metrics covered in (for the python SDK
>> first, then Java). To collect latency of IO API RPCS, and store it in a
>> histogram.
>> > https://s.apache.org/beam-gcp-debuggability
>> >
>> > User histogram metrics are unfunded, as far as I know. But you should
>> be able to extend what I do for that project to the user metric use case. I
>> agree, it won't be much more work to support that. I designed the histogram
>> with the user histogram case in mind.
>>
>> From the portability point of view, all metrics generated in users
>> code (and SDK-side IOs are "user code") are user metrics. But
>> regardless of how things are named, once we have histogram metrics
>> crossing the FnAPI boundary all the infrastructure will be in place.
>> (At least the plan as I understand it shouldn't use private APIs
>> accessible only by the various IOs but not other SDK-level code.)
>>
>> > On Fri, Aug 14, 2020 at 5:47 PM Robert Bradshaw 
>> wrote:
>> >>
>> >> Once histograms are implemented in the SDK(s) (Alex, you're tackling
>> >> this, right?) it shoudn't be much work to update the Samza worker code
>> >> to publish these via the Samza runner APIs (in parallel with Alex's
>> >> work to do the same on Dataflow).
>> >>
>> >> On Fri, Aug 14, 2020 at 5:35 PM Alex Amato  wrote:
>> >> >
>> >> > Noone has any plans currently to work on adding a generic histogram
>> metric, at the moment.
>> >> >
>> >> > But I will be actively working on adding it for a specific set of
>> metrics in the next quarter or so
>> >> > https://s.apache.org/beam-gcp-debuggability
>> >> >
>> >> > After that work, one could take a look at my PRs for reference to
>> create new metrics using the same histogram. One may wish to implement the
>> UserHistogram use case and use that in the Samza Runner
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Fri, Aug 14, 2020 at 5:25 PM Ke Wu  wrote:
>> >> >>
>> >> >> Thank you Robert and Alex. I am not running a Beam job in Google
>> Cloud but with Samza Runner, so I am wondering if there is any ETA to add
>> the Histogram metrics in Metrics class so it can be mapped to the
>> SamzaHistogram metric to the actual emitting.
>> >> >>
>> >> >> Best,
>> >> >> Ke
>> >> >>
>> >> >> On Aug 14, 2020, at 4:44 PM, Alex Amato  wrote:
>> >> >>
>> >> >> One of the plans to use the histogram data is to send it to Google
>> Monitoring to compute estimates of percentiles. This is done using the
>> bucket counts and bucket boundaries.
>> >> >>
>> >> >> Here is a describing of roughly how its calculated.
>> >> >>
>> https://stackoverflow.com/questions/59635115/gcp-console-how-are-percentile-charts-calculated
>> >> >> This is a non exact estimate. But plotting the estimated
>> percentiles over time is often easier to understand and sufficient.
>> >> >> (An alternative is a heatmap chart representing histograms over
>> time. I.e. a histogram for each window of time).
>> >> >>
>> >> >>
>> >> >> On Fri, Aug 14, 2020 at 4:16 PM Robert Bradshaw <
>> rober...@google.com> wrote:
>> >> >>>
>> >> >>> You may be interested in the propose

Re: Chronically flaky tests

2020-07-16 Thread Gleb Kanterov

There is something called test-retry-gradle-plugin [1]. It retries tests if
they fail, and have different modes to handle flaky tests. Did we ever try
or consider using it?

[1]: https://github.com/gradle/test-retry-gradle-plugin

On Thu, Jul 16, 2020 at 1:15 PM Gleb Kanterov  wrote:

> I agree with what Ahmet is saying. I can share my perspective, recently I
> had to retrigger build 6 times due to flaky tests, and each retrigger took
> one hour of waiting time.
>
> I've seen examples of automatic tracking of flaky tests, where a test is
> considered flaky if both fails and succeeds for the same git SHA. Not sure
> if there is anything we can enable to get this automatically.
>
> /Gleb
>
> On Thu, Jul 16, 2020 at 2:33 AM Ahmet Altay  wrote:
>
>> I think it will be reasonable to disable/sickbay any flaky test that is
>> actively blocking people. Collective cost of flaky tests for such a large
>> group of contributors is very significant.
>>
>> Most of these issues are unassigned. IMO, it makes sense to assign these
>> issues to the most relevant person (who added the test/who generally
>> maintains those components). Those people can either fix and re-enable the
>> tests, or remove them if they no longer provide valuable signals.
>>
>> Ahmet
>>
>> On Wed, Jul 15, 2020 at 4:55 PM Kenneth Knowles  wrote:
>>
>>> The situation is much worse than that IMO. My experience of the last few
>>> days is that a large portion of time went to *just connecting failing runs
>>> with the corresponding Jira tickets or filing new ones*.
>>>
>>> Summarized on PRs:
>>>
>>>  - https://github.com/apache/beam/pull/12272#issuecomment-659050891
>>>  - https://github.com/apache/beam/pull/12273#issuecomment-659070317
>>>  - https://github.com/apache/beam/pull/12225#issuecomment-656973073
>>>  - https://github.com/apache/beam/pull/12225#issuecomment-657743373
>>>  - https://github.com/apache/beam/pull/12224#issuecomment-657744481
>>>  - https://github.com/apache/beam/pull/12216#issuecomment-657735289
>>>  - https://github.com/apache/beam/pull/12216#issuecomment-657780781
>>>  - https://github.com/apache/beam/pull/12216#issuecomment-657799415
>>>
>>> The tickets:
>>>
>>>  - https://issues.apache.org/jira/browse/BEAM-10460
>>> SparkPortableExecutionTest
>>>  - https://issues.apache.org/jira/browse/BEAM-10471 CassandraIOTest >
>>> testEstimatedSizeBytes
>>>  - https://issues.apache.org/jira/browse/BEAM-10504 ElasticSearchIOTest
>>> > testWriteFullAddressing and testWriteWithIndexFn
>>>  - https://issues.apache.org/jira/browse/BEAM-10470 JdbcDriverTest
>>>  - https://issues.apache.org/jira/browse/BEAM-8025 CassandraIOTest
>>> > @BeforeClass (classmethod)
>>>  - https://issues.apache.org/jira/browse/BEAM-8454 FnHarnessTest
>>>  - https://issues.apache.org/jira/browse/BEAM-10506
>>> SplunkEventWriterTest
>>>  - https://issues.apache.org/jira/browse/BEAM-10472 direct runner
>>> ParDoLifecycleTest
>>>  - https://issues.apache.org/jira/browse/BEAM-9187
>>> DefaultJobBundleFactoryTest
>>>
>>> Here are our P1 test flake bugs:
>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flake%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
>>>
>>> It seems quite a few of them are actively hindering people right now.
>>>
>>> Kenn
>>>
>>> On Wed, Jul 15, 2020 at 4:23 PM Andrew Pilloud 
>>> wrote:
>>>
>>>> We have two test suites that are responsible for a large percentage of
>>>> our flaky tests and  both have bugs open for about a year without being
>>>> fixed. These suites are ParDoLifecycleTest (BEAM-8101
>>>> <https://issues.apache.org/jira/browse/BEAM-8101>) in Java
>>>> and BigQueryWriteIntegrationTests in python (py3 BEAM-9484
>>>> <https://issues.apache.org/jira/browse/BEAM-9484>, py2 BEAM-9232
>>>> <https://issues.apache.org/jira/browse/BEAM-9232>, old duplicate
>>>> BEAM-8197 <https://issues.apache.org/jira/browse/BEAM-8197>).
>>>>
>>>> Are there any volunteers to look into these issues? What can we do to
>>>> mitigate the flakiness until someone has time to investigate?
>>>>
>>>> Andrew
>>>>
>>>

Re: Chronically flaky tests

2020-07-16 Thread Gleb Kanterov

I agree with what Ahmet is saying. I can share my perspective, recently I
had to retrigger build 6 times due to flaky tests, and each retrigger took
one hour of waiting time.

I've seen examples of automatic tracking of flaky tests, where a test is
considered flaky if both fails and succeeds for the same git SHA. Not sure
if there is anything we can enable to get this automatically.

/Gleb

On Thu, Jul 16, 2020 at 2:33 AM Ahmet Altay  wrote:

> I think it will be reasonable to disable/sickbay any flaky test that is
> actively blocking people. Collective cost of flaky tests for such a large
> group of contributors is very significant.
>
> Most of these issues are unassigned. IMO, it makes sense to assign these
> issues to the most relevant person (who added the test/who generally
> maintains those components). Those people can either fix and re-enable the
> tests, or remove them if they no longer provide valuable signals.
>
> Ahmet
>
> On Wed, Jul 15, 2020 at 4:55 PM Kenneth Knowles  wrote:
>
>> The situation is much worse than that IMO. My experience of the last few
>> days is that a large portion of time went to *just connecting failing runs
>> with the corresponding Jira tickets or filing new ones*.
>>
>> Summarized on PRs:
>>
>>  - https://github.com/apache/beam/pull/12272#issuecomment-659050891
>>  - https://github.com/apache/beam/pull/12273#issuecomment-659070317
>>  - https://github.com/apache/beam/pull/12225#issuecomment-656973073
>>  - https://github.com/apache/beam/pull/12225#issuecomment-657743373
>>  - https://github.com/apache/beam/pull/12224#issuecomment-657744481
>>  - https://github.com/apache/beam/pull/12216#issuecomment-657735289
>>  - https://github.com/apache/beam/pull/12216#issuecomment-657780781
>>  - https://github.com/apache/beam/pull/12216#issuecomment-657799415
>>
>> The tickets:
>>
>>  - https://issues.apache.org/jira/browse/BEAM-10460
>> SparkPortableExecutionTest
>>  - https://issues.apache.org/jira/browse/BEAM-10471 CassandraIOTest >
>> testEstimatedSizeBytes
>>  - https://issues.apache.org/jira/browse/BEAM-10504 ElasticSearchIOTest
>> > testWriteFullAddressing and testWriteWithIndexFn
>>  - https://issues.apache.org/jira/browse/BEAM-10470 JdbcDriverTest
>>  - https://issues.apache.org/jira/browse/BEAM-8025 CassandraIOTest
>> > @BeforeClass (classmethod)
>>  - https://issues.apache.org/jira/browse/BEAM-8454 FnHarnessTest
>>  - https://issues.apache.org/jira/browse/BEAM-10506 SplunkEventWriterTest
>>  - https://issues.apache.org/jira/browse/BEAM-10472 direct runner
>> ParDoLifecycleTest
>>  - https://issues.apache.org/jira/browse/BEAM-9187
>> DefaultJobBundleFactoryTest
>>
>> Here are our P1 test flake bugs:
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flake%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
>>
>> It seems quite a few of them are actively hindering people right now.
>>
>> Kenn
>>
>> On Wed, Jul 15, 2020 at 4:23 PM Andrew Pilloud 
>> wrote:
>>
>>> We have two test suites that are responsible for a large percentage of
>>> our flaky tests and  both have bugs open for about a year without being
>>> fixed. These suites are ParDoLifecycleTest (BEAM-8101
>>> ) in Java
>>> and BigQueryWriteIntegrationTests in python (py3 BEAM-9484
>>> , py2 BEAM-9232
>>> , old duplicate
>>> BEAM-8197 ).
>>>
>>> Are there any volunteers to look into these issues? What can we do to
>>> mitigate the flakiness until someone has time to investigate?
>>>
>>> Andrew
>>>
>>

Re: [ANNOUNCE] New PMC Member: Alexey Romanenko

2020-06-17 Thread Gleb Kanterov

Congratulations! Thanks for your hard work

On Wed, Jun 17, 2020 at 1:11 PM Alexey Romanenko 
wrote:

> Thank you Ismaël and everybody!
> Happy to be a part of Beam community!
>
> On 17 Jun 2020, at 09:31, Jan Lukavský  wrote:
>
> Congrats Alexey!
> On 6/17/20 9:22 AM, Reza Rokni wrote:
>
> Congratulations!
>
> On Wed, Jun 17, 2020 at 2:48 PM Michał Walenia 
> wrote:
>
>> Congratulations!
>>
>> On Tue, Jun 16, 2020 at 11:45 PM Rui Wang  wrote:
>>
>>> Congrats!
>>>
>>>
>>> -Rui
>>>
>>> On Tue, Jun 16, 2020 at 2:42 PM Ankur Goenka  wrote:
>>>
 Congratulations Alexey!

 On Tue, Jun 16, 2020 at 2:41 PM Thomas Weise  wrote:

> Congratulations!
>
>
> On Tue, Jun 16, 2020 at 1:27 PM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> Congratulations!
>>
>> On Tue, Jun 16, 2020 at 11:41 AM Ahmet Altay 
>> wrote:
>>
>>> Congratulations!
>>>
>>> On Tue, Jun 16, 2020 at 10:05 AM Pablo Estrada 
>>> wrote:
>>>
 Yooohooo! Thanks for all your contributions and hard work Alexey!:)

 On Tue, Jun 16, 2020, 8:57 AM Ismaël Mejía 
 wrote:

> Please join me and the rest of Beam PMC in welcoming Alexey
> Romanenko as our
> newest PMC member.
>
> Alexey has significantly contributed to the project in different
> ways: new
> features and improvements in the Spark runner(s) as well as
> maintenance of
> multiple IO connectors including some of our most used ones (Kafka
> and
> Kinesis/Aws). Alexey is also quite active helping new contributors
> and our user
> community in the mailing lists / slack and Stack overflow.
>
> Congratulations Alexey!  And thanks for being a part of Beam!
>
> Ismaël
>

>>
>> --
>> Michał Walenia
>> Polidea  | Software Engineer
>> M: +48 791 432 002 <+48791432002>
>> E: michal.wale...@polidea.com
>> Unique Tech
>> Check out our projects! 
>>
>
>

Re: [ANNOUNCE] New committer: Robin Qiu

2020-05-19 Thread Gleb Kanterov

Congratulations!

On Tue, May 19, 2020 at 7:31 AM Aizhamal Nurmamat kyzy 
wrote:

> Congratulations, Robin! Thank you for your contributions!
>
> On Mon, May 18, 2020, 7:18 PM Boyuan Zhang  wrote:
>
>> Congrats~~
>>
>> On Mon, May 18, 2020 at 7:17 PM Reza Rokni  wrote:
>>
>>> Congratulations!
>>>
>>> On Tue, May 19, 2020 at 10:06 AM Ahmet Altay  wrote:
>>>
 Hi everyone,

 Please join me and the rest of the Beam PMC in welcoming a new
 committer: Robin Qiu .

 Robin has been active in the community for close to 2 years, worked
 on HyperLogLog++ [1], SQL [2], improved documentation, and helped with
 releases(*).

 In consideration of his contributions, the Beam PMC trusts him with the
 responsibilities of a Beam committer [3].

 Thank you for your contributions Robin!

 -Ahmet, on behalf of the Apache Beam PMC

 [1] https://www.meetup.com/Zurich-Apache-Beam-Meetup/events/265529665/
 [2] https://www.meetup.com/Belgium-Apache-Beam-Meetup/events/264933301/
 [3] https://beam.apache.org/contribute/become-a-committer
 /#an-apache-beam-committer
 (*) And maybe he will be a release manager soon :)

Re: /zetasql/local_service/liblocal_service_jni.jnilib was not found inside JAR

2020-03-26 Thread Gleb Kanterov

Great work. I've tried these jars, and it works on my end.

/Gleb

On Wed, Mar 25, 2020 at 7:58 PM Andrew Pilloud  wrote:

> This should be fixed in ZetaSQL 2020.03.2, which will be coming out soon.
> We've verified it on multiple machines with OS X 10.14.6. If you have
> another version, I could use help testing.
>
> Check out this PR: https://github.com/apache/beam/pull/11223
> Run: ./gradlew :sdks:java:extensions:sql:zetasql:check
>
> Thanks!
>
> Andrew
>
> On Tue, Mar 17, 2020 at 2:50 AM Gleb Kanterov  wrote:
>
>> There is a branch that builds ZetaSQL on Mac, it only works with Bazel
>> 0.25.3. You would need XCode to build it locally. After you build it, and
>> put jnilib into classpath, it just works. One of my colleagues has updated
>> this branch to the latest release [1].
>>
>> /Gleb
>>
>> [1]: https://github.com/csenel/zetasql/tree/darwin-build
>>
>> On Mon, Mar 16, 2020 at 8:26 PM Tomo Suzuki  wrote:
>>
>>> I see. Thank you. I also found Alex's ticket for zetasql.
>>> https://github.com/google/zetasql/issues/25
>>>
>>> Closing thread.
>>>
>>> Regards,
>>> Tomo
>>>
>>> On Mon, Mar 16, 2020 at 3:21 PM Andrew Pilloud 
>>> wrote:
>>>
>>>> That error is expected unless you've built your own jar with
>>>> liblocal_service_jni.jnilib for OS X. A few have tried but no one has
>>>> succeeded (as far as I know), it is on the ZetaSQL team's todo list. You'll
>>>> need to run that module on Linux for now.
>>>>
>>>> See: https://github.com/google/zetasql/pull/3
>>>>
>>>> Andrew
>>>>
>>>> On Mon, Mar 16, 2020 at 12:09 PM Tomo Suzuki 
>>>> wrote:
>>>>
>>>>> Hi Beam developers,
>>>>>
>>>>> I started getting test failures when building Beam in my MacBook Pro.
>>>>> Module: sdks/java/extensions/sql/zetasql. The NoClassDefFoundError occurs
>>>>> because of jnilib file is missing.
>>>>>
>>>>> Caused by: java.lang.RuntimeException: java.io.FileNotFoundException:
>>>>> File /zetasql/local_service/liblocal_service_jni.jnilib was not found
>>>>> inside JAR.
>>>>> at
>>>>> com.google.zetasql.JniChannelProvider.(JniChannelProvider.java:68)
>>>>> ... 69 more
>>>>> Caused by: java.io.FileNotFoundException: File
>>>>> /zetasql/local_service/liblocal_service_jni.jnilib was not found inside 
>>>>> JAR.
>>>>> at
>>>>> com.google.zetasql.cz.adamh.utils.NativeUtils.loadLibraryFromJar(NativeUtils.java:105)
>>>>> at
>>>>> com.google.zetasql.JniChannelProvider.(JniChannelProvider.java:66)
>>>>> ... 69 more
>>>>>
>>>>> Full log:
>>>>> https://gist.github.com/suztomo/f3d8815e8f48aeabd0288de34c1488f0
>>>>>
>>>>> Has anyone encountered a similar problem?
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Tomo
>>>>>
>>>>
>>>
>>> --
>>> Regards,
>>> Tomo
>>>
>>

Re: /zetasql/local_service/liblocal_service_jni.jnilib was not found inside JAR

2020-03-17 Thread Gleb Kanterov

There is a branch that builds ZetaSQL on Mac, it only works with Bazel
0.25.3. You would need XCode to build it locally. After you build it, and
put jnilib into classpath, it just works. One of my colleagues has updated
this branch to the latest release [1].

/Gleb

[1]: https://github.com/csenel/zetasql/tree/darwin-build

On Mon, Mar 16, 2020 at 8:26 PM Tomo Suzuki  wrote:

> I see. Thank you. I also found Alex's ticket for zetasql.
> https://github.com/google/zetasql/issues/25
>
> Closing thread.
>
> Regards,
> Tomo
>
> On Mon, Mar 16, 2020 at 3:21 PM Andrew Pilloud 
> wrote:
>
>> That error is expected unless you've built your own jar with
>> liblocal_service_jni.jnilib for OS X. A few have tried but no one has
>> succeeded (as far as I know), it is on the ZetaSQL team's todo list. You'll
>> need to run that module on Linux for now.
>>
>> See: https://github.com/google/zetasql/pull/3
>>
>> Andrew
>>
>> On Mon, Mar 16, 2020 at 12:09 PM Tomo Suzuki  wrote:
>>
>>> Hi Beam developers,
>>>
>>> I started getting test failures when building Beam in my MacBook Pro.
>>> Module: sdks/java/extensions/sql/zetasql. The NoClassDefFoundError occurs
>>> because of jnilib file is missing.
>>>
>>> Caused by: java.lang.RuntimeException: java.io.FileNotFoundException:
>>> File /zetasql/local_service/liblocal_service_jni.jnilib was not found
>>> inside JAR.
>>> at
>>> com.google.zetasql.JniChannelProvider.(JniChannelProvider.java:68)
>>> ... 69 more
>>> Caused by: java.io.FileNotFoundException: File
>>> /zetasql/local_service/liblocal_service_jni.jnilib was not found inside JAR.
>>> at
>>> com.google.zetasql.cz.adamh.utils.NativeUtils.loadLibraryFromJar(NativeUtils.java:105)
>>> at
>>> com.google.zetasql.JniChannelProvider.(JniChannelProvider.java:66)
>>> ... 69 more
>>>
>>> Full log:
>>> https://gist.github.com/suztomo/f3d8815e8f48aeabd0288de34c1488f0
>>>
>>> Has anyone encountered a similar problem?
>>>
>>> --
>>> Regards,
>>> Tomo
>>>
>>
>
> --
> Regards,
> Tomo
>

Re: [VOTE] Upgrade gradle to 6.2

2020-02-25 Thread Gleb Kanterov

+1 (non-binding)

On Tue, Feb 25, 2020 at 9:38 AM Ismaël Mejía  wrote:

> +1 great to have our build updated, please share if there are new
> interesting features/plugin advantages we can benefit from too.
>
> On Tue, Feb 25, 2020 at 8:24 AM Jean-Baptiste Onofré 
> wrote:
>
>> Hi Alex
>>
>> I also have couple of contacts at Gradle. Let me know if needed.
>>
>> Regards
>> JB
>>
>> Le mar. 25 f?vr. 2020 ? 08:20, Alex Van Boxel  a
>> ?crit :
>>
>>> OK, great. I know someone that works at gradle, so I can ping them when
>>> I have some problems.
>>>
>>> Any other know pitfalls I can expect, let me know :-)
>>>
>>>  _/
>>> _/ Alex Van Boxel
>>>
>>>
>>> On Tue, Feb 25, 2020 at 7:20 AM Jean-Baptiste Onofr? 
>>> wrote:
>>>
>>> +1
>>>
>>> It makes sense.
>>>
>>> Thanks.
>>> Regards
>>> JB
>>>
>>> Le lun. 24 f?vr. 2020 ? 22:37, Alex Van Boxel  a
>>> ?crit :
>>>
>>> Anyone objections that I upgrade gradle to 6.2. If ok this will be done
>>> over several commits where I will:
>>>
>>>- Upgrade plugins
>>>- Upgrade gradle to 6.2
>>>- See where we can use some of the new features
>>>
>>>
>>>  _/
>>> _/ Alex Van Boxel
>>>
>>>

Re: [ANNOUNCE] New committer: Chad Dombrova

2020-02-25 Thread Gleb Kanterov

Congratulations!

On Tue, Feb 25, 2020 at 9:44 AM Ismaël Mejía  wrote:

> Congratulations, so well deserved for the lots of amazing work and new
> perspectives you have broght into the project !!!
>
> On Tue, Feb 25, 2020 at 8:24 AM Austin Bennett <
> whatwouldausti...@gmail.com> wrote:
>
>> Hooray!
>>
>> On Mon, Feb 24, 2020, 11:21 PM Alex Van Boxel  wrote:
>>
>>> Congrats!
>>>
>>>  _/
>>> _/ Alex Van Boxel
>>>
>>>
>>> On Tue, Feb 25, 2020 at 6:21 AM Thomas Weise  wrote:
>>>
 Congratulations!


 On Mon, Feb 24, 2020, 3:38 PM Ankur Goenka  wrote:

> Congratulations Chad!
>
> On Mon, Feb 24, 2020 at 3:34 PM Ahmet Altay  wrote:
>
>> Congratulations!
>>
>> On Mon, Feb 24, 2020 at 3:25 PM Sam Bourne  wrote:
>>
>>> Nice one Chad. Your typing efforts are very welcomed.
>>>
>>> On Tue, Feb 25, 2020 at 10:16 AM Yichi Zhang 
>>> wrote:
>>>
 Congratulations, Chad!

 On Mon, Feb 24, 2020 at 3:10 PM Robert Bradshaw <
 rober...@google.com> wrote:

> Well deserved, Chad. Congratulations!
>
> On Mon, Feb 24, 2020 at 2:43 PM Reza Rokni  wrote:
> >
> > Congratulations! :-)
> >
> > On Tue, Feb 25, 2020 at 6:41 AM Chad Dombrova 
> wrote:
> >>
> >> Thanks, folks!  I'm very excited to "retest this" :)
> >>
> >> Especially big thanks to Robert and Udi for all their hard work
> reviewing my PRs.
> >>
> >> -chad
> >>
> >>
> >> On Mon, Feb 24, 2020 at 1:44 PM Brian Hulette <
> bhule...@google.com> wrote:
> >>>
> >>> Congratulations Chad! Thanks for all your contributions :)
> >>>
> >>> On Mon, Feb 24, 2020 at 1:43 PM Kyle Weaver <
> kcwea...@google.com> wrote:
> 
>  Well-deserved, thanks for your dedication to the project
> Chad. :)
> 
>  On Mon, Feb 24, 2020 at 1:34 PM Udi Meiri 
> wrote:
> >
> > Congrats and welcome, Chad!
> >
> > On Mon, Feb 24, 2020 at 1:21 PM Pablo Estrada <
> pabl...@google.com> wrote:
> >>
> >> Hi everyone,
> >>
> >> Please join me and the rest of the Beam PMC in welcoming a
> new committer: Chad Dombrova
> >>
> >> Chad has contributed to the project in multiple ways,
> including improvements to the testing infrastructure, and adding type
> annotations throughout the Python SDK, as well as working closely 
> with the
> community on these improvements.
> >>
> >> In consideration of his contributions, the Beam PMC trusts
> him with the responsibilities of a Beam Committer[1].
> >>
> >> Thanks Chad for your contributions!
> >>
> >> -Pablo, on behalf of the Apache Beam PMC.
> >>
> >> [1]
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>

Re: [ANNOUNCE] New committer: Jincheng Sun

2020-02-24 Thread Gleb Kanterov

Congratulations!

On Mon, Feb 24, 2020 at 1:18 PM Hequn Cheng  wrote:

> Congratulations Jincheng, well deserved!
>
> Best,
> Hequn
>
> On Mon, Feb 24, 2020 at 7:21 PM Reza Rokni  wrote:
>
>> Congrats!
>>
>> On Mon, Feb 24, 2020 at 7:15 PM Jan Lukavský  wrote:
>>
>>> Congrats Jincheng!
>>>
>>>   Jan
>>>
>>> On 2/24/20 11:55 AM, Maximilian Michels wrote:
>>> > Hi everyone,
>>> >
>>> > Please join me and the rest of the Beam PMC in welcoming a new
>>> > committer: Jincheng Sun 
>>> >
>>> > Jincheng has worked on generalizing parts of Beam for Flink's Python
>>> > API. He has also picked up other issues, like fixing documentation,
>>> > implementing missing features, or cleaning up code [1].
>>> >
>>> > In consideration of his contributions, the Beam PMC trusts him with
>>> > the responsibilities of a Beam committer [2].
>>> >
>>> > Thank you for your contributions Jincheng!
>>> >
>>> > -Max, on behalf of the Apache Beam PMC
>>> >
>>> > [1]
>>> >
>>> https://jira.apache.org/jira/browse/BEAM-9299?jql=project%20%3D%20BEAM%20AND%20assignee%20in%20(sunjincheng121)
>>> > [2]
>>> >
>>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>>
>>

Re: [ANNOUNCE] New committer: Alex Van Boxel

2020-02-18 Thread Gleb Kanterov

Congratulations!

On Tue, Feb 18, 2020 at 5:02 PM Brian Hulette  wrote:

> Congratulations Alex! Well deserved!
>
> On Tue, Feb 18, 2020 at 7:49 AM Pablo Estrada  wrote:
>
>> Hi everyone,
>>
>> Please join me and the rest of the Beam PMC in welcoming a new committer:
>> Alex Van Boxel
>>
>> Alex has contributed to Beam in many ways - as an organizer for Beam
>> Summit, and meetups - and also with the Protobuf extensions for schemas.
>>
>> In consideration of his contributions, the Beam PMC trusts him with the
>> responsibilities of a Beam committer[1].
>>
>> Thanks for your contributions Alex!
>>
>> Pablo, on behalf of the Apache Beam PMC.
>>
>> [1] https://beam.apache.org/contribute/become-a-committer
>> /#an-apache-beam-committer
>>
>

Re: Custom 2.20 failing on Dataflow: what am I doing wrong?

2020-02-17 Thread Gleb Kanterov

You need to pass custom Dataflow worker jar. One of the ways of doing that
is adding it as a dependency, and using following code snippet:

opts.setDataflowWorkerJar(
  BatchDataflowWorker.class
  .getProtectionDomain()
  .getCodeSource()
  .getLocation()
  .toString());
opts.setWorkerHarnessContainerImage("");

Coming with the disclaimer that it isn't for production :)

On Mon, Feb 17, 2020 at 8:34 AM Alex Van Boxel  wrote:

> Yes, running it manually with the normal parameters as I do for production
> Dataflow. I'm probably a bit ignorant on that, and I probably need to
> provide my own worker.
>
> Thanks for the hint... I'll dive into that.
>
>  _/
> _/ Alex Van Boxel
>
>
> On Mon, Feb 17, 2020 at 8:16 AM Reuven Lax  wrote:
>
>> Are you running things manually? This probably means you are using an
>> out-of-date Dataflow worker. I believe that all tests on Jenkins will build
>> the Dataflow worker from head to prevent exactly this problem.
>>
>> On Sun, Feb 16, 2020 at 11:10 PM Alex Van Boxel  wrote:
>>
>>> Digging further in the traces, it seems like a result of changes to the
>>> model:
>>>
>>> Caused by: java.lang.ClassNotFoundException:
>>> org.apache.beam.model.pipeline.v1.StandardWindowFns$SessionsPayload$Enum
>>>
>>> I see changes by Lukasz Cwik. Will this be a problem for the release?
>>>
>>>  _/
>>> _/ Alex Van Boxel
>>>
>>>
>>> On Sun, Feb 16, 2020 at 12:11 PM Alex Van Boxel 
>>> wrote:
>>>
 Hey,

 I'm testing my own PR's against Dataflow, something I've done in the
 past with success seem to fail now. I get this error:

 java.lang.NoClassDefFoundError: Could not initialize class
 org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.construction.WindowingStrategyTranslation

1.


 Am I doing something wrong?

  _/
 _/ Alex Van Boxel

>>>

Re: Deterministic field ordering in derived schemas

2020-02-06 Thread Gleb Kanterov

Field ordering matters, for instance, for batch pipeline writing to a
non-partitioned BigQuery table. Each partition is a new table with own
schema. Each day a new table would have non-deterministic field ordering.
It's arguable if it's a good practice to define table schema using Java
class, even if field ordering was deterministic. Because schema definition
language is embedded into Java, it isn't as clear as for instance for
Protobuf, if a change keeps schema compatibility. However, I can see how
borrowing the concept of field numbers would make it more clear.

A similar concern is relevant to streaming pipelines if there is no
"update" functionality or pipeline that needs to be drained and restarted.

What are the requirements for updating streaming pipelines? Is it only that
encoding positions for existing fields shouldn't change? With that, I don't
understand how "sort all the fields" makes the "update" case worse. As I
see, it fixes writing to external storage, doesn't solve the problem of
"update", but doesn't make it worse.

Gleb

On Thu, Feb 6, 2020 at 6:01 AM Reuven Lax  wrote:

> Let's understand the use case first.
>
> My concern was with making SchemaCoder compatible between different
> invocations of a pipeline, and that's why I introduced encoding_position.
> This allows the field id to change, but we can preserve the same
> encoding_position. However this is internal to a pipeline.
>
> If the worry is writing rows to a sink, how are the rows being written? I
> would highly advise against using Beam's internal binary representation to
> write rows external to a pipeline. That representation is meant to be an
> internal detail of schemas, not a public binary format. Rows should be
> converted to some public format before being written.
>
> I wonder if a convenience method on Row - getValuesOrderedByName() - would
> be sufficient for this use case?
>
> Reuven
>
> On Wed, Feb 5, 2020 at 8:49 PM Kenneth Knowles  wrote:
>
>> Are we in danger of reinventing protobuf's practice of giving fields
>> numbers? (this practice itself almost certainly used decades before
>> protobufs creation). Could we just use the same practice?
>>
>> Schema fields already have integer IDs and "encoding_position" (see
>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/schema.proto).
>> Are these the same as proto field numbers? Do we need both? What is the
>> expectation around how they interact? The proto needs
>> comments/documentation!
>>
>> This does not directly address the question, but any solution related to
>> how auto-generated schemas work should be specified in terms of the proto.
>> For example, annotations to suggest one or both of these fields. Or,
>> lacking that, sorting by name (giving up on "new fields come last"
>> behavior. Or warning that the schema is unstable. Etc.
>>
>> Kenn
>>
>> On Wed, Feb 5, 2020 at 10:47 AM Luke Cwik  wrote:
>>
>>> The Java compiler doesn't know about whether a field was added or
>>> removed when compiling source to class so there is no way for it to provide
>>> an ordering that puts "new" fields at the end and the source specification
>>> doesn't allow for users to state the field ordering that should be used.
>>> You can ask users to annotate a field ordering[1] using custom annotations
>>> but a general solution will require some type of sorting.
>>>
>>> 1: https://stackoverflow.com/a/1099389/4368200
>>>
>>> On Wed, Feb 5, 2020 at 10:31 AM Reuven Lax  wrote:
>>>
>>>> I have yet to figure out a way to make Schema inference
>>>> deterministically ordered, because Java reflection provides no guaranteed
>>>> ordering (I suspect that the JVM returns functions by iterating over a hash
>>>> map, or something of that form). Ideas such as "sort all the fields"
>>>> actually makes things worse, because new fields will end up in the middle
>>>> of the field list.
>>>>
>>>> This is a problem for runners that support an "update" functionality.
>>>> Currently the solution I was working on was to allow the runner to inspect
>>>> the previous graph on an update, to ensure that we maintain the previous
>>>> order.
>>>>
>>>> If you know a way to ensure deterministic ordering, I would love to
>>>> know. I even went so far as to try and open the .class file to get members
>>>> in the order defined there, but that is very complex, error prone, and I
>>>> believe still doesn't guarantee order stability.
>&g

Deterministic field ordering in derived schemas

2020-02-05 Thread Gleb Kanterov

There are Beam schema providers that use Java reflection to get fields for
classes with fields and auto-value classes. It isn't relevant for POJOs
with "creators", because function arguments are ordered. We cache instances
of schema coders, but there is no guarantee that it's deterministic between
JVMs. As a result, I've seen cases when the construction of pipeline graphs
and output schema is non-deterministic. It's especially relevant when
writing data to external storage, where row schema becomes a table schema.
There is a workaround to apply a transform that would make schema
deterministic, for instance, by ordering fields by name.

I would see a benefit in making schemas deterministic by default or at
least introducing a way to do so without writing custom code. What are your
thoughts?

Re: [ANNOUNCE] New committer: Michał Walenia

2020-01-28 Thread Gleb Kanterov

Congratulations!

On Tue, Jan 28, 2020 at 6:03 PM Łukasz Gajowy  wrote:

> Congratulations Michał! 
>
> wt., 28 sty 2020 o 16:33 Ryan Skraba  napisał(a):
>
>> Congratulations!
>>
>> On Tue, Jan 28, 2020 at 11:26 AM Jan Lukavský  wrote:
>>
>>> Congrats Michał!
>>> On 1/28/20 11:16 AM, Katarzyna Kucharczyk wrote:
>>>
>>> Congratulations Michał!  
>>>
>>> On Tue, Jan 28, 2020 at 9:29 AM Alexey Romanenko <
>>> aromanenko@gmail.com> wrote:
>>>
 Congrats, Michał!

 On 28 Jan 2020, at 09:20, Ismaël Mejía  wrote:

 Congratulations Michał, well deserved!

 On Tue, Jan 28, 2020 at 8:54 AM Kamil Wasilewski <
 kamil.wasilew...@polidea.com> wrote:

> Congrats, Michał!
>
> On Tue, Jan 28, 2020 at 3:03 AM Udi Meiri  wrote:
>
>> Congratulations Michał!
>>
>> On Mon, Jan 27, 2020 at 3:49 PM Chamikara Jayalath <
>> chamik...@google.com> wrote:
>>
>>> Congrats Michał!
>>>
>>> On Mon, Jan 27, 2020 at 2:59 PM Reza Rokni  wrote:
>>>
 Congratulations buddy!

 On Tue, 28 Jan 2020, 06:52 Valentyn Tymofieiev, <
 valen...@google.com> wrote:

> Congratulations, Michał!
>
> On Mon, Jan 27, 2020 at 2:24 PM Austin Bennett <
> whatwouldausti...@gmail.com> wrote:
>
>> Nice -- keep up the good work!
>>
>> On Mon, Jan 27, 2020 at 2:02 PM Mikhail Gryzykhin <
>> mig...@google.com> wrote:
>> >
>> > Congratulations Michal!
>> >
>> > --Mikhail
>> >
>> > On Mon, Jan 27, 2020 at 1:01 PM Kyle Weaver <
>> kcwea...@google.com> wrote:
>> >>
>> >> Congratulations Michał! Looking forward to your future
>> contributions :)
>> >>
>> >> Thanks,
>> >> Kyle
>> >>
>> >> On Mon, Jan 27, 2020 at 12:47 PM Pablo Estrada <
>> pabl...@google.com> wrote:
>> >>>
>> >>> Hi everyone,
>> >>>
>> >>> Please join me and the rest of the Beam PMC in welcoming a
>> new committer: Michał Walenia
>> >>>
>> >>> Michał has contributed to Beam in many ways, including the
>> performance testing infrastructure, and has even spoken at events 
>> about
>> Beam.
>> >>>
>> >>> In consideration of his contributions, the Beam PMC trusts
>> him with the responsibilities of a Beam committer[1].
>> >>>
>> >>> Thanks for your contributions Michał!
>> >>>
>> >>> Pablo, on behalf of the Apache Beam PMC.
>> >>>
>> >>> [1]
>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>
>

Re: Beam's Avro 1.8.x dependency

2020-01-16 Thread Gleb Kanterov

Adding to Ismaël, I find moving Avro out of the core, and keeping
compatibility as a non-exclusive options. Of course, it would require more
effort from our side.

On Thu, Jan 16, 2020 at 12:29 PM Ismaël Mejía  wrote:

> For the interested there was also some extra context in the discussion at:
> https://github.com/apache/beam/pull/9779
>
> Gleb mentioned the key two points:
>
> 1. The fact that Avro is exposed in the User API in beam-sdks-java-core
> was a
>mistake and makes fixing this issue backwards incompatible.
>
> 2. Shading is not an option because Avro compiler would generate specific
>records that won't match the non-vendored version so we will break user
>records compatibility (for example for users with a schema registry).
>
> So save if I am missing something and someone can give an alternative, we
> are
> in a situation where the only solution to the issue is to do (1), move
> Avro out
> of core as an extension but then the question is would we sacrifice
> breaking
> backwards compatibility for this issue. I am in the 'we should do it' camp.
> What do others think?
>
>
> On Thu, Jan 16, 2020 at 10:17 AM Gleb Kanterov  wrote:
>
>> There are significant changes between Avro 1.8 and Avro 1.9. I'm not sure
>> it's possible for beam-sdks-java-core to support both at the same time. The
>> fact that AvroIO is a part of the beam-sdks-java-core doesn't make it
>> simpler. However, I can see how we can build two binary artifacts with the
>> same user-facing API each supporting own version of Avro.
>>
>> Shading or vendoring would be a breaking change because public signatures
>> of AvroIO (and few other IOs, for instance, BigQueryIO) refer to classes
>> from Avro, for instance, GenericRecord. Furthermore, a lot of Beam users
>> use Avro compiler to generate Java code for SpecificRecord, which would
>> refer to non-vendored version.
>>
>

Re: Beam's Avro 1.8.x dependency

2020-01-16 Thread Gleb Kanterov

There are significant changes between Avro 1.8 and Avro 1.9. I'm not sure
it's possible for beam-sdks-java-core to support both at the same time. The
fact that AvroIO is a part of the beam-sdks-java-core doesn't make it
simpler. However, I can see how we can build two binary artifacts with the
same user-facing API each supporting own version of Avro.

Shading or vendoring would be a breaking change because public signatures
of AvroIO (and few other IOs, for instance, BigQueryIO) refer to classes
from Avro, for instance, GenericRecord. Furthermore, a lot of Beam users
use Avro compiler to generate Java code for SpecificRecord, which would
refer to non-vendored version.

Re: [VOTE] Beam's Mascot will be the Firefly (Lampyridae)

2019-12-13 Thread Gleb Kanterov

+1 (non-binding)

On Fri, Dec 13, 2019 at 12:47 PM jincheng sun 
wrote:

> +1 （non-binding）
>
> Alex Van Boxel 于2019年12月13日 周五16:21写道：
>
>> +1
>>
>> On Fri, Dec 13, 2019, 05:58 Kenneth Knowles  wrote:
>>
>>> Please vote on the proposal for Beam's mascot to be the Firefly. This
>>> encompasses the Lampyridae family of insects, without specifying a genus or
>>> species.
>>>
>>> [ ] +1, Approve Firefly being the mascot
>>> [ ] -1, Disapprove Firefly being the mascot
>>>
>>> The vote will be open for at least 72 hours excluding weekends. It is
>>> adopted by at least 3 PMC +1 approval votes, with no PMC -1 disapproval
>>> votes*. Non-PMC votes are still encouraged.
>>>
>>> PMC voters, please help by indicating your vote as "(binding)"
>>>
>>> Kenn
>>>
>>> *I have chosen this format for this vote, even though Beam uses simple
>>> majority as a rule, because I want any PMC member to be able to veto based
>>> on concerns about overlap or trademark.
>>>
>> --
>
> Best,
> Jincheng
> -
> Committer & PMC Member at @ApacheFlink
> Staff Engineer at @Alibaba
> Blog: https://enjoyment.cool
> Twitter: https://twitter.com/sunjincheng121
> --
>

Re: Detecting resources to stage

2019-11-27 Thread Gleb Kanterov

Agree, this makes sense.

On Wed, Nov 27, 2019 at 6:23 PM Luke Cwik  wrote:

> That looks good as well.
>
> I would suggest that we make the classpath scanning system pluggable using
> PipelineOptions. For example in GcpOptions[1], we use two default instance
> factories. The first one controls which class is used as the factory[2] and
> the second one instantiates an instance of that class and creates the
> credential[3]. The same strategy could be added where there is a default
> instance factory for the set of resources and another option which controls
> which class is instantiated to provide that default.
>
> Do you think that we could make the default always:
> new ClassGraph()
>   .addClassLoader(classLoader)
>   .getClasspathURLs();
>
> 1:
> https://github.com/apache/beam/blob/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L159
> 2:
> https://github.com/apache/beam/blob/3e7865ee6c6a56e51199515ec5b4b16de1ddd166/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L144
> 3:
> https://github.com/apache/beam/blob/3e7865ee6c6a56e51199515ec5b4b16de1ddd166/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L159
>
> On Wed, Nov 27, 2019 at 8:19 AM Gleb Kanterov  wrote:
>
>> I didn't think it through, but this is something I have in mind. Keep
>> existing implementation for URLClassLoader, and use URLClassLoader for
>> experimental support of Java 11.
>>
>> List urls;
>> if (classLoader instanceof URLClassLoader) {
>>   urls = Arrays.asList(((URLClassLoader) classLoader).getURLs());
>> } else {
>>   urls = new ClassGraph()
>>   .addClassLoader(classLoader)
>>   .getClasspathURLs();
>> }
>>
>> On Wed, Nov 27, 2019 at 4:16 PM Łukasz Gajowy 
>> wrote:
>>
>>> This looks promising. Do you think you could share your code as well?
>>>
>>> That part sounds very calming:
>>> "ClassGraph is fully compatible with the new JPMS module system (Project
>>> Jigsaw / JDK 9+), i.e. it can scan both the traditional classpath and the
>>> module path. However, the code is also fully backwards compatible with JDK
>>> 7 and JDK 8 (i.e. the code is compiled in Java 7 compatibility mode, and
>>> all interaction with the module system is implemented via reflection for
>>> backwards compatibility)."
>>>
>>> I'm currently working on rebuilding the classpath detection mechanism so
>>> that it scans java.class.path when URLClassLoader cannot be used (as Luke
>>> suggested) but if we decide to use classgraph it should be relatively easy
>>> to do that instead. Moreover, I want to enable the possibility of injecting
>>> any algorithm implementation through pipeline options - this will enable
>>> third-party vendors to inject their custom implementations if needed (SPI
>>> pattern that was mentioned at some point in a jira ticket). I think I'm
>>> pretty close to finishing that.
>>>
>>> Thanks!
>>>
>>> śr., 27 lis 2019 o 15:24 Gleb Kanterov  napisał(a):
>>>
>>>> Today I tried using classgraph [1] library to scan classpath in Java 11
>>>> instead of using URLClassLoader, and after that, the job worked on
>>>> Dataflow. The logic of scanning classpath is pretty sophisticated [2], and
>>>> classgraph doesn't have any dependencies. I'm wondering if we can relocate
>>>> it to java-core jar and use in for non-URLClassLoaders?
>>>>
>>>> [1]: https://github.com/classgraph/classgraph
>>>> [2]:
>>>> https://github.com/classgraph/classgraph/blob/master/src/main/java/io/github/classgraph/Scanner.java
>>>>
>>>> On Fri, Nov 8, 2019 at 11:40 PM Luke Cwik  wrote:
>>>>
>>>>> I believe the closest suggestion[1] we had that worked for Java 11 and
>>>>> maintained backwards compatibility was to use the URLClassLoader to infer
>>>>> the resources and if we couldn't do that then look at the java.class.path
>>>>> system property to do the inference otherwise fail and force the users to
>>>>> tell us what. There are too many scenarios where we will do it wrong
>>>>> because of how people package and deploy their code whether it is an
>>>>> embedded application server or some other application container with a
>>>>> security manager that will prevent us

Re: Detecting resources to stage

2019-11-27 Thread Gleb Kanterov

I didn't think it through, but this is something I have in mind. Keep
existing implementation for URLClassLoader, and use URLClassLoader for
experimental support of Java 11.

List urls;
if (classLoader instanceof URLClassLoader) {
  urls = Arrays.asList(((URLClassLoader) classLoader).getURLs());
} else {
  urls = new ClassGraph()
  .addClassLoader(classLoader)
  .getClasspathURLs();
}

On Wed, Nov 27, 2019 at 4:16 PM Łukasz Gajowy 
wrote:

> This looks promising. Do you think you could share your code as well?
>
> That part sounds very calming:
> "ClassGraph is fully compatible with the new JPMS module system (Project
> Jigsaw / JDK 9+), i.e. it can scan both the traditional classpath and the
> module path. However, the code is also fully backwards compatible with JDK
> 7 and JDK 8 (i.e. the code is compiled in Java 7 compatibility mode, and
> all interaction with the module system is implemented via reflection for
> backwards compatibility)."
>
> I'm currently working on rebuilding the classpath detection mechanism so
> that it scans java.class.path when URLClassLoader cannot be used (as Luke
> suggested) but if we decide to use classgraph it should be relatively easy
> to do that instead. Moreover, I want to enable the possibility of injecting
> any algorithm implementation through pipeline options - this will enable
> third-party vendors to inject their custom implementations if needed (SPI
> pattern that was mentioned at some point in a jira ticket). I think I'm
> pretty close to finishing that.
>
> Thanks!
>
> śr., 27 lis 2019 o 15:24 Gleb Kanterov  napisał(a):
>
>> Today I tried using classgraph [1] library to scan classpath in Java 11
>> instead of using URLClassLoader, and after that, the job worked on
>> Dataflow. The logic of scanning classpath is pretty sophisticated [2], and
>> classgraph doesn't have any dependencies. I'm wondering if we can relocate
>> it to java-core jar and use in for non-URLClassLoaders?
>>
>> [1]: https://github.com/classgraph/classgraph
>> [2]:
>> https://github.com/classgraph/classgraph/blob/master/src/main/java/io/github/classgraph/Scanner.java
>>
>> On Fri, Nov 8, 2019 at 11:40 PM Luke Cwik  wrote:
>>
>>> I believe the closest suggestion[1] we had that worked for Java 11 and
>>> maintained backwards compatibility was to use the URLClassLoader to infer
>>> the resources and if we couldn't do that then look at the java.class.path
>>> system property to do the inference otherwise fail and force the users to
>>> tell us what. There are too many scenarios where we will do it wrong
>>> because of how people package and deploy their code whether it is an
>>> embedded application server or some other application container with a
>>> security manager that will prevent us from doing the right thing.
>>>
>>> On Fri, Nov 8, 2019 at 10:31 AM Robert Bradshaw 
>>> wrote:
>>>
>>>> Note that resources are more properly tied to specific operations and
>>>> stages, not to the entire pipeline. This is especially true in the
>>>> face of libraries (which should have the ability to declare their own
>>>> resources) and cross-language.
>>>>
>>>> On Fri, Nov 8, 2019 at 10:19 AM Łukasz Gajowy 
>>>> wrote:
>>>> >
>>>> > I figured that it would be good to bump this thread for greater
>>>> visibility even though I don't have a strong opinion about this (yet -
>>>> hopefully, I will know more later to be able to share ;) ).
>>>> >
>>>> > Answering the questions Luke asked will unblock this issue:
>>>> https://issues.apache.org/jira/browse/BEAM-5495. Solving it is needed
>>>> for Java 11 migration (current detecting mechanism does not work with java
>>>> > 8).
>>>> >
>>>> >
>>>> >>
>>>> >> That said letting the user resolve the jars to stage can be saner
>>>> instead of assuming it is in the classpath/loader. I already have a few
>>>> cases where it will fail cause the transforms load the jars from outside
>>>> the app classloader (transforms are isolated).
>>>> >
>>>> >
>>>> >
>>>> > If I understand correctly, at least in Dataflow runner, if users want
>>>> to provide custom resources to stage, they can use filesToStage pipeline
>>>> option. Once the option is not null, the runner doesn't detect the
>>>> resources automatically and stages resources enlisted in the option
>>>> instead. I think this should be the approach common for all runners (if it
>>>> is not the case already).
>>>>
>>>
>>> Your understanding is correct and consistency across runners for a
>>> pipeline option is good for our users.
>>>
>>>
>>>> >
>>>> > Thanks,
>>>> > Łukasz
>>>> >
>>>> >
>>>> >
>>>>
>>>
>>> 1: https://github.com/apache/beam/pull/8775
>>>
>>

Re: Detecting resources to stage

2019-11-27 Thread Gleb Kanterov

Today I tried using classgraph [1] library to scan classpath in Java 11
instead of using URLClassLoader, and after that, the job worked on
Dataflow. The logic of scanning classpath is pretty sophisticated [2], and
classgraph doesn't have any dependencies. I'm wondering if we can relocate
it to java-core jar and use in for non-URLClassLoaders?

[1]: https://github.com/classgraph/classgraph
[2]:
https://github.com/classgraph/classgraph/blob/master/src/main/java/io/github/classgraph/Scanner.java

On Fri, Nov 8, 2019 at 11:40 PM Luke Cwik  wrote:

> I believe the closest suggestion[1] we had that worked for Java 11 and
> maintained backwards compatibility was to use the URLClassLoader to infer
> the resources and if we couldn't do that then look at the java.class.path
> system property to do the inference otherwise fail and force the users to
> tell us what. There are too many scenarios where we will do it wrong
> because of how people package and deploy their code whether it is an
> embedded application server or some other application container with a
> security manager that will prevent us from doing the right thing.
>
> On Fri, Nov 8, 2019 at 10:31 AM Robert Bradshaw 
> wrote:
>
>> Note that resources are more properly tied to specific operations and
>> stages, not to the entire pipeline. This is especially true in the
>> face of libraries (which should have the ability to declare their own
>> resources) and cross-language.
>>
>> On Fri, Nov 8, 2019 at 10:19 AM Łukasz Gajowy  wrote:
>> >
>> > I figured that it would be good to bump this thread for greater
>> visibility even though I don't have a strong opinion about this (yet -
>> hopefully, I will know more later to be able to share ;) ).
>> >
>> > Answering the questions Luke asked will unblock this issue:
>> https://issues.apache.org/jira/browse/BEAM-5495. Solving it is needed
>> for Java 11 migration (current detecting mechanism does not work with java
>> > 8).
>> >
>> >
>> >>
>> >> That said letting the user resolve the jars to stage can be saner
>> instead of assuming it is in the classpath/loader. I already have a few
>> cases where it will fail cause the transforms load the jars from outside
>> the app classloader (transforms are isolated).
>> >
>> >
>> >
>> > If I understand correctly, at least in Dataflow runner, if users want
>> to provide custom resources to stage, they can use filesToStage pipeline
>> option. Once the option is not null, the runner doesn't detect the
>> resources automatically and stages resources enlisted in the option
>> instead. I think this should be the approach common for all runners (if it
>> is not the case already).
>>
>
> Your understanding is correct and consistency across runners for a
> pipeline option is good for our users.
>
>
>> >
>> > Thanks,
>> > Łukasz
>> >
>> >
>> >
>>
>
> 1: https://github.com/apache/beam/pull/8775
>

Re: [VOTE] Beam Mascot animal choice: vote for as many as you want

2019-11-22 Thread Gleb Kanterov

[ ] Beaver
[ ] Hedgehog
[ ] Lemur
[X] Owl
[ ] Salmon
[ ] Trout
[ ] Robot dinosaur
[ ] Firefly
[ ] Cuttlefish
[ ] Dumbo Octopus
[ ] Angler fish


On Fri, Nov 22, 2019 at 11:33 PM Andrew Pilloud  wrote:

>
> [ ] Beaver
> [ ] Hedgehog
> [ ] Lemur
> [ ] Owl
> [X] Salmon
> [X] Trout
> [ ] Robot dinosaur
> [X] Firefly
> [ ] Cuttlefish
> [ ] Dumbo Octopus
> [ ] Angler fish
>
>

Re: goVet and clickHouse tests failing

2019-11-21 Thread Gleb Kanterov

:sdks:java:io:clickhouse:test is using testcontainers. Testcontainers is a
Java library that supports JUnit tests, providing lightweight, throwaway
instances of common databases, Selenium web browsers, or anything else that
can run in a Docker container. Therefore, ClickHouse tests require a local
Docker environment. It might make sense to move tests using testcontainers
into a separate gradle task if it's a common issue.

On Thu, Nov 21, 2019 at 12:10 PM Elliotte Rusty Harold 
wrote:

> I'm slowly working my way through getting the tests to run and pass.
> We have a lot of work to do on the contributing docs to explain how to
> setup and run the build. There's clearly a lot of knowledge in
> developers' heads and workstations that hasn't yet made it into the
> docs.
>
> The latest is a problem finding "github.com/linkedin/goavro" when I
> run goVet. I'm not a go person. Is this something that requires an
> extra install? If so, how is it installed? Or is this some error in
> the build.gradles? Or perhaps my go config is borked and gradle is
> looking in the wrong directory?
>
> > Task :sdks:go:examples:resolveBuildDependencies
> Resolving ./github.com/apache/beam/sdks/go@/home/elharo/beam/sdks/go
> .gogradle/project_gopath/src/
> github.com/apache/beam/sdks/go/examples/vendor/github.com/apache/beam/sdks/go/pkg/beam/io/avroio/avroio.go:28:2
> :
> cannot find package "github.com/linkedin/goavro" in any of:
> /home/elharo/beam/sdks/go/examples/.gogradle/project_gopath/src/
> github.com/apache/beam/sdks/go/examples/vendor/github.com/linkedin/goavro
> (vendor tree)
> /home/elharo/.gradle/go/binary/1.12/go/src/
> github.com/linkedin/goavro
> (from $GOROOT)
> /home/elharo/beam/sdks/go/examples/.gogradle/project_gopath/src/
> github.com/linkedin/goavro
> (from $GOPATH)
>
> > Task :sdks:go:examples:goVet FAILED
>
> I'm also seeing failures in ClickHouseIOTest:
>
> > Task :sdks:java:io:clickhouse:test
>
> org.apache.beam.sdk.io.clickhouse.ClickHouseIOTest > classMethod FAILED
> java.lang.IllegalStateException
>
> org.apache.beam.sdk.io.clickhouse.ClickHouseIOTest > classMethod FAILED
> java.lang.NullPointerException
>
> org.apache.beam.sdk.io.clickhouse.AtomicInsertTest > classMethod FAILED
> java.lang.IllegalStateException
>
> org.apache.beam.sdk.io.clickhouse.AtomicInsertTest > classMethod FAILED
> java.lang.NullPointerException
>
> --
> Elliotte Rusty Harold
> elh...@ibiblio.org
>

Re: [ANNOUNCE] New committer: Daniel Oliveira

2019-11-21 Thread Gleb Kanterov

Congratulations!

On Thu, Nov 21, 2019 at 6:24 AM Thomas Weise  wrote:

> Congratulations!
>
>
> On Wed, Nov 20, 2019, 7:56 PM Chamikara Jayalath 
> wrote:
>
>> Congrats!!
>>
>> On Wed, Nov 20, 2019 at 5:21 PM Daniel Oliveira 
>> wrote:
>>
>>> Thank you everyone! I won't let you down. o7
>>>
>>> On Wed, Nov 20, 2019 at 2:12 PM Ruoyun Huang  wrote:
>>>
 Congrats Daniel!

 On Wed, Nov 20, 2019 at 1:58 PM Robert Burke 
 wrote:

> Congrats Daniel! Much deserved.
>
> On Wed, Nov 20, 2019, 12:49 PM Udi Meiri  wrote:
>
>> Congrats Daniel!
>>
>> On Wed, Nov 20, 2019 at 12:42 PM Kyle Weaver 
>> wrote:
>>
>>> Congrats Dan! Keep up the good work :)
>>>
>>> On Wed, Nov 20, 2019 at 12:41 PM Cyrus Maden 
>>> wrote:
>>>
 Congratulations! This is great news.

 On Wed, Nov 20, 2019 at 3:24 PM Rui Wang  wrote:

> Congrats!
>
>
> -Rui
>
> On Wed, Nov 20, 2019 at 11:48 AM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> Congrats, Daniel!
>>
>> On Wed, Nov 20, 2019 at 11:47 AM Kenneth Knowles 
>> wrote:
>>
>>> Hi all,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a new
>>> committer: Daniel Oliveira
>>>
>>> Daniel introduced himself to dev@ over two years ago and has
>>> contributed in many ways since then. Daniel has contributed to 
>>> general
>>> project health, the portability framework, and all three languages: 
>>> Java,
>>> Python SDK, and Go. I would like to particularly highlight how he 
>>> deleted
>>> 12k lines of dead reference runner code [1].
>>>
>>> In consideration of Daniel's contributions, the Beam PMC trusts
>>> him with the responsibilities of a Beam committer [2].
>>>
>>> Thank you, Daniel, for your contributions and looking forward to
>>> many more!
>>>
>>> Kenn, on behalf of the Apache Beam PMC
>>>
>>> [1] https://github.com/apache/beam/pull/8380
>>> [2]
>>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>>
>>

 --
 
 Ruoyun  Huang

Re: [PROPOSAL] Add support for writing flattened schemas to pubsub

2019-11-17 Thread Gleb Kanterov

Expanding on what Kenn said regarding having fewer dependencies on SQL. Can
the whole thing be seen as extending PubSubIO, that would implement most of
the logic from the proposal, given column annotations, and then having a
thin layer that connects it with Beam SQL tables?

On Sun, Nov 17, 2019 at 12:38 PM Alex Van Boxel  wrote:

> I like it, but I'm worried about the magic event_timestamp injection.
> Wouldn't explicit injection via option not be a better approach:
>
> CREATE TABLE people (
> my_timestamp TIMESTAMP *OPTION(ref="pubsub:event_timestamp)*,
> my_id VARCHAR *OPTION(ref="pubsub:attributes['id_name']")*,
> name VARCHAR,
> age INTEGER
>   )
>   TYPE 'pubsub'
>   LOCATION 'projects/my-project/topics/my-topic'
>
>
>  _/
> _/ Alex Van Boxel
>
>
> On Sat, Nov 16, 2019 at 7:58 PM Kenneth Knowles  wrote:
>
>> Big +1 from me.
>>
>> Nice explanation. This makes a lot of sense. Much simpler to understand
>> with fewer magic strings. It also makes the Beam SQL connector less
>> dependent on newer SQL features that are simply less widespread. I'm not
>> too surprised that Calcite's nested row support lags behind the rest of the
>> library. It simply isn't as widespread and important as flat relational
>> structures. And MAP is even less widespread.
>>
>> Kenn
>>
>> On Wed, Nov 13, 2019 at 12:32 PM Brian Hulette 
>> wrote:
>>
>>> I've been looking into adding support for writing (i.e. INSERT INTO
>>> statements) for the pubsub DDL, which currently only supports reading. This
>>> DDL requires the defined schema to have exactly three fields:
>>> event_timestamp, attributes, and payload, corresponding to the fields in
>>> PubsubMessage (event_timestamp can be configured to come from either
>>> publish time or from the value in a particular attribute, and the payload
>>> must be a ROW with a schema corresponding to the JSON written to the pubsub
>>> topic).
>>>
>>> When writing, I think it's a bit onerous to require users to use exactly
>>> these three top-level fields. For example imagine we have two topics:
>>> people, and eligible_voters. people contains a stream of {"name": "..",
>>> age: XX} items, and we want eligible_voters to contain a stream with
>>> {"name": ".."} items corresponding to people with age >= 18. With the
>>> current approach this would look like:
>>>
>>> ```
>>> CREATE TABLE people (
>>> event_timestamp TIMESTAMP,
>>> attributes MAP,
>>> payload ROW
>>>   )
>>>   TYPE 'pubsub'
>>>   LOCATION 'projects/my-project/topics/my-topic'
>>>
>>> CREATE TABLE eligible_voters 
>>>
>>> INSERT INTO eligible_voters (
>>>   SELECT
>>> ROW(payload.name AS name) AS payload
>>> FROM people
>>> WHERE payload.age >= 18
>>> )
>>> ```
>>>
>>> This query has lots of renaming and boiler-plate, and furthermore,
>>> ROW(..) doesn't seem well supported in Calcite, I had to jump through some
>>> hoops (like calling my fields $col1), to make something like this work.
>>> I think it would be great if we could instead handle flattened,
>>> payload-only schemas. We would still need to have a separate
>>> event_timestamp field, but everything else would map to a field in the
>>> payload. With this change the previous example would look like:
>>>
>>> ```
>>> CREATE TABLE people (
>>> event_timestamp TIMESTAMP,
>>> name VARCHAR,
>>> age INTEGER
>>>   )
>>>   TYPE 'pubsub'
>>>   LOCATION 'projects/my-project/topics/my-topic'
>>>
>>> CREATE TABLE eligible_voters ...
>>>
>>> INSERT INTO eligible_voters (
>>>   SELECT
>>> name
>>> FROM people
>>> WHERE age >= 18
>>> )
>>> ```
>>>
>>> This is much cleaner! But the overall approach has an obvious downside -
>>> with the tabke definition written like this it's impossible to read from or
>>> write to the message attributes (unless one is being used for
>>> event_timestamp). I think we can mitigate this in two ways:
>>> 1. In the future, this flattened schema definition could be represented
>>> as something like a view on the expanded definition. We could allow users
>>> to provide some metadata indicating that a column should correspond to a
>>> particular attribute, rather than a field in the payload. To me this feels
>>> similar to how you indicate a column should be indexed in a database. It's
>>> data that's relevant to the storage system, and not to the actual query, so
>>> it belongs in CREATE TABLE.
>>> 2. In the meantime, we can continue to support the current syntax. If a
>>> pubsub table definition has *exactly* three fields with the expected types:
>>> event_timestamp TIMESTAMP, payload ROW<...>, and attributes MAP>> VARCHAR>, we can continue to use the current codepath. Otherwise we will
>>> use the flattened schema.
>>>
>>> Please let me know if anyone has any objections to this approach,
>>> otherwise I plan on moving forward with it - I should have a PR up shortly.
>>>
>>> Brian
>>>
>>

Re: [ANNOUNCE] New committer: Brian Hulette

2019-11-14 Thread Gleb Kanterov

Congratulations!

On Fri, Nov 15, 2019 at 5:44 AM Valentyn Tymofieiev 
wrote:

> Congratulations, Brian!
>
> On Thu, Nov 14, 2019 at 6:25 PM jincheng sun 
> wrote:
>
>> Congratulation Brian!
>>
>> Best,
>> Jincheng
>>
>> Kyle Weaver  于2019年11月15日周五 上午7:19写道：
>>
>>> Thanks for your contributions and congrats Brian!
>>>
>>> On Thu, Nov 14, 2019 at 3:14 PM Kenneth Knowles  wrote:
>>>
 Hi all,

 Please join me and the rest of the Beam PMC in welcoming a new
 committer: Brian Hulette

 Brian introduced himself to dev@ earlier this year and has been
 contributing since then. His contributions to Beam include explorations of
 integration with Arrow, standardizing coders, portability for schemas, and
 presentations at Beam events.

 In consideration of Brian's contributions, the Beam PMC trusts him with
 the responsibilities of a Beam committer [1].

 Thank you, Brian, for your contributions and looking forward to many
 more!

 Kenn, on behalf of the Apache Beam PMC

 [1]
 https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer

>>>

Re: [UPDATE] Preparing for Beam 2.17.0 release

2019-10-28 Thread Gleb Kanterov

It looks like BigQueryIO DIRECT_READ is broken since 2.16.0, I've added a
ticket describing the problem and possible fix, see BEAM-8504
 [1].

[1]: https://issues.apache.org/jira/browse/BEAM-8504

On Wed, Oct 23, 2019 at 9:19 PM Kenneth Knowles  wrote:

> I opened https://github.com/apache/beam/pull/9862 to raise the
> documentation of Fix Version to the top level. It also includes the write
> up of Jira priorities, to make clear that "Blocker" priority does not refer
> to release blocking.
>
> On Wed, Oct 23, 2019 at 11:16 AM Kenneth Knowles  wrote:
>
>> I've gone over the tickets and removed Fix Version from many of them that
>> do not seem to be critical defects. If I removed Fix Version from a ticket
>> you care about, please feel free to add it back. I am not trying to decide
>> what is in/out of the release, just trying to triage the Jira data to match
>> expected practices.
>>
>> It should probably be documented somewhere outside of the release guide.
>> As far as I can tell, the fact that we triage them down to zero is the only
>> place we mention that it is used to indicate release blockers and not used
>> for feature targets.
>>
>> Kenn
>>
>> On Wed, Oct 23, 2019 at 10:40 AM Kenneth Knowles  wrote:
>>
>>>  Wow, 28 release blocking tickets! That is the most I've ever seen, by
>>> far. Many appear to be feature requests, not release-blocking defects. I
>>> believe this is not according to our normal best practice. The release
>>> cadence should not wait for features in progress, with exceptions discussed
>>> on dev@. As a matter of best practice, I think we should triage feature
>>> requests to not have Fix Version set until it has been discussed on dev@
>>> .
>>>
>>> Kenn
>>>
>>> On Wed, Oct 23, 2019 at 9:55 AM Mikhail Gryzykhin 
>>> wrote:
>>>
 Hi all,

 Beam 2.17 release branch cut is scheduled today (2019/10/23) according
 to the release calendar [1].  I'll start working on the branch cutoff
 and later work on cherry picking blocker fixes.

 If you have release blocking issues for 2.17 please mark their "Fix
 Version" as 2.17.0 [2]. This tag is already created in JIRA in case you
 would like to move any non-blocking issues to that version.

 There is a decent amount of open bugs to be resolved in 2.17.0 [2] and
 only 4 [3] are marked as blockers. Please, review those if these bugs are
 actually to be resolved in 2.17.0 and prioritize fixes if possible.

 Any thoughts, comments, objections?

 Regards.
 Mikhail.


 [1]
 https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
 [2]
 https://issues.apache.org/jira/browse/BEAM-8457?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Reopened%2C%20Open%2C%20%22In%20Progress%22%2C%20%22Under%20Discussion%22%2C%20%22In%20Implementation%22%2C%20%22Triage%20Needed%22)%20AND%20fixVersion%20%3D%202.17.0
 [3]
 https://issues.apache.org/jira/browse/BEAM-8457?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Reopened%2C%20Open%2C%20%22In%20Progress%22%2C%20%22Under%20Discussion%22%2C%20%22In%20Implementation%22%2C%20%22Triage%20Needed%22)%20AND%20priority%20%3D%20Blocker%20AND%20fixVersion%20%3D%202.17.0

>>>

Re: [ANNOUNCE] New committer: Alan Myrvold

2019-09-30 Thread Gleb Kanterov

Congratulations!

On Sat, Sep 28, 2019 at 12:07 AM Valentyn Tymofieiev 
wrote:

> Congratulations, Alan. Well deserved.
>
> On Fri, Sep 27, 2019 at 2:09 PM Chamikara Jayalath 
> wrote:
>
>> Congrats Alan!!
>>
>> On Fri, Sep 27, 2019 at 1:49 PM Jan Lukavský  wrote:
>>
>>> Congrats Alan!
>>> On 9/27/19 10:22 PM, Mark Liu wrote:
>>>
>>> Congratulations Alan!!!
>>>
>>> On Fri, Sep 27, 2019 at 12:55 PM Ning Kang  wrote:
>>>
 Congrats Alan!

 On Fri, Sep 27, 2019 at 12:02 PM Ankur Goenka 
 wrote:

> Congratulations Alan!
>
> On Fri, Sep 27, 2019 at 11:17 AM Yichi Zhang 
> wrote:
>
>> Congrats, Alan!
>>
>> On Fri, Sep 27, 2019 at 10:26 AM Robin Qiu 
>> wrote:
>>
>>> Congrats, Alan!
>>>
>>> On Fri, Sep 27, 2019 at 10:15 AM Hannah Jiang <
>>> hannahji...@google.com> wrote:
>>>
 Congrats Alan!

 On Fri, Sep 27, 2019 at 9:57 AM Ruoyun Huang 
 wrote:

> Congratulations, Alan!
>
>
> On Fri, Sep 27, 2019 at 9:55 AM Rui Wang 
> wrote:
>
>> Congrats!
>>
>> -Rui
>>
>> On Fri, Sep 27, 2019 at 9:54 AM Pablo Estrada 
>> wrote:
>>
>>> Yooh! : D
>>>
>>> On Fri, Sep 27, 2019 at 9:53 AM Yifan Zou 
>>> wrote:
>>>
 Congratulations, Alan!

 On Fri, Sep 27, 2019 at 9:18 AM Ahmet Altay 
 wrote:

> Hi,
>
> Please join me and the rest of the Beam PMC in welcoming a new
> committer: Alan Myrvold
>
> Alan has been a long time Beam contributor. His contributions
> made Beam more productive and friendlier [1] for all contributors 
> with
> significant improvements to Beam release process, automation, and
> infrastructure.
>
> In consideration of Alan's contributions, the Beam PMC trusts
> him
> with the responsibilities of a Beam committer [2].
>
> Thank you, Alan, for your contributions and looking forward to
> many more!
>
> Ahmet, on behalf of the Apache Beam PMC
>
> [1]
> https://beam-summit-na-2019.firebaseapp.com/schedule/2019-09-11?sessionId=1126
> [2] https://beam.apache.org/contribute/become-a-committer
> /#an-apache-beam-committer
>

>
> --
> 
> Ruoyun  Huang
>
>

-- 
Cheers,
Gleb

Re: clickhouse tests failing

2019-09-12 Thread Gleb Kanterov

These tests are using testcontainers and assume that you have Docker
environment locally.

On Sun, Sep 8, 2019 at 5:14 PM Lukasz Cwik  wrote:

> Is passing at head on Jenkins:
> https://builds.apache.org/job/beam_PreCommit_Java_Cron/1771/testReport/org.apache.beam.sdk.io.clickhouse/
>
> What are the failures your seeing at initialization? (the tests do rely on
> setting up zookeeper and other stuff that could fail)
>
> On Fri, Sep 6, 2019 at 12:36 PM Elliotte Rusty Harold 
> wrote:
>
>> At head I noticed the following:
>>
>>
>> $ ./gradlew -p sdks/java/io/ check
>> Configuration on demand is an incubating feature.
>>
>> > Task :sdks:java:io:clickhouse:test
>>
>> org.apache.beam.sdk.io.clickhouse.ClickHouseIOTest > classMethod FAILED
>> java.lang.IllegalStateException
>>
>> org.apache.beam.sdk.io.clickhouse.ClickHouseIOTest > classMethod FAILED
>> java.lang.NullPointerException
>>
>> org.apache.beam.sdk.io.clickhouse.AtomicInsertTest > classMethod FAILED
>> java.lang.IllegalStateException
>>
>> org.apache.beam.sdk.io.clickhouse.AtomicInsertTest > classMethod FAILED
>> java.lang.NullPointerException
>>
>> 29 tests completed, 4 failed
>>
>> > Task :sdks:java:io:clickhouse:test FAILED
>>
>> FAILURE: Build failed with an exception.
>>
>>
>> Is anyone else seeing this? Are the tests expected to pass, or is
>> there some requirement (e.g. Java 11) that I might be missing?
>>
>> --
>> Elliotte Rusty Harold
>> elh...@ibiblio.org
>>
>

-- 
Cheers,
Gleb

Re: [DISCUSS] Portability representation of schemas

2019-09-03 Thread Gleb Kanterov

Thanks, Brian. It makes sense, it wasn't entirely clear from commit
messages, that's why I wanted to double check.

On Tue, Sep 3, 2019 at 5:43 PM Brian Hulette  wrote:

> Hey Gleb, thanks for bringing this up. The PR that was reverted (8853) is
> the same one that I referred to earlier in this thread. It modified the
> existing portable schema representation to match what we settled on here -
> and yes it removed support for logical types like fixed bytes. I
> (foolishly) assumed the portable schema representation wasn't actually used
> anywhere yet, so I figured it would be a harmless regression until we were
> able to add support again with a logical type registry (BEAM-7855 [1]). But
> it turns out the proto representation is used to build cloud objects in the
> dataflow runner, and so the change led to issues in Dataflow [2]. Thus the
> reversion.
>
> I absolutely agree now we should finish BEAM-7855 first before making
> this change again. I'm also taking a look at some of the tests that should
> have caught this issue to see how it slipped through the cracks.
>
> Brian
>
> [1] https://issues.apache.org/jira/browse/BEAM-7855
> [2] https://issues.apache.org/jira/browse/BEAM-8111
>
> On Tue, Sep 3, 2019 at 7:27 AM Gleb Kanterov  wrote:
>
>> Recently there was a pull request (that was reverted) for adding portable
>> representation of schemas. It's great to see things moving forward, I'm
>> worried that it doesn't support any logical types, especially fixed bytes.
>> That makes runners using portable schemas unusable, for instance, when
>> reading certain Avro files, while it was possible before. I'm wondering if
>> it would be possible to include logical types into initial implementation
>> or add a feature flag to opt-in to portable schema representation.
>>
>> [1]: https://github.com/apache/beam/pull/8853
>>
>> Gleb
>>
>> On Wed, Jun 19, 2019 at 7:08 PM Kenneth Knowles  wrote:
>>
>>> Seems like a practical approach to get moving on things. Just to restate
>>> my understanding:
>>>
>>>  - in Java it is PCollection but with the row coder holding
>>> to/from/clazz (I'm calling it row coder because row is the binary format,
>>> while schemas may have multiple possible formats)
>>>  - in portability, the URN + payload "is" the type, so it is
>>> PCollection
>>>  - in Python it is interpreted as PCollection
>>>
>>> We may eventually hit an issue caused by the asymmetry between
>>> portability, the authoring SDK (in this case Java), and the external SDK
>>> (in this case hypothetically Python). But perhaps the asymmetry is
>>> natural/harmless. We may eventually want it in the general form of the
>>> payload carrying identifying info of what the authoring SDK is.
>>>
>>> As to attaching conversions to operations rather than coders, I'm not so
>>> sure. It gets at the two natures of coders:
>>>
>>> 1. URN + payload fully specify how to interpret the bytes of an element
>>> 2. An SDK's implementation of the coder for a URN + payload is a
>>> contract with DoFns authored in that SDK what SDK-specific type they will
>>> receive
>>>
>>> The idea of storing the to/from conversions on operations themselves
>>> would be a sort of decoupling of 1 and 2. Assuming we design something that
>>> still allows eliding conversions, I expect that will look nearly identical.
>>> Coders are representation to/from Bytes, while this conversion layer is T
>>> to/from representation. So I might think we can do it without adding
>>> anything to the model.
>>>
>>> Kenn
>>>
>>> On Tue, Jun 18, 2019 at 11:04 PM Reuven Lax  wrote:
>>>
>>>> Robert, you are correct that in principle the to/from functions are
>>>> needed on the operation, as that's where automatic conversion happens (in
>>>> Java it happens in DoFnRunner). However there are two blockers there:
>>>>
>>>> 1. As Brian mentioned, the issue in Java is that we never have
>>>> PCollection in this case. The source PCollection will simply be
>>>> PCollection, where T has a schema. The to/from functions are now
>>>> required to interpret this PCollection. Currently we need to put it on the
>>>> PCollection itself to may Java's type system happy (an alternative is to
>>>> always create an intermediate PCollection, but that would be
>>>> computationally expensive). We might be able to find a way to model this in
>>>> Java with the to/from on the operation, however I suspe

Re: [DISCUSS] Portability representation of schemas

2019-09-03 Thread Gleb Kanterov

Recently there was a pull request (that was reverted) for adding portable
representation of schemas. It's great to see things moving forward, I'm
worried that it doesn't support any logical types, especially fixed bytes.
That makes runners using portable schemas unusable, for instance, when
reading certain Avro files, while it was possible before. I'm wondering if
it would be possible to include logical types into initial implementation
or add a feature flag to opt-in to portable schema representation.

[1]: https://github.com/apache/beam/pull/8853

Gleb

On Wed, Jun 19, 2019 at 7:08 PM Kenneth Knowles  wrote:

> Seems like a practical approach to get moving on things. Just to restate
> my understanding:
>
>  - in Java it is PCollection but with the row coder holding
> to/from/clazz (I'm calling it row coder because row is the binary format,
> while schemas may have multiple possible formats)
>  - in portability, the URN + payload "is" the type, so it is
> PCollection
>  - in Python it is interpreted as PCollection
>
> We may eventually hit an issue caused by the asymmetry between
> portability, the authoring SDK (in this case Java), and the external SDK
> (in this case hypothetically Python). But perhaps the asymmetry is
> natural/harmless. We may eventually want it in the general form of the
> payload carrying identifying info of what the authoring SDK is.
>
> As to attaching conversions to operations rather than coders, I'm not so
> sure. It gets at the two natures of coders:
>
> 1. URN + payload fully specify how to interpret the bytes of an element
> 2. An SDK's implementation of the coder for a URN + payload is a contract
> with DoFns authored in that SDK what SDK-specific type they will receive
>
> The idea of storing the to/from conversions on operations themselves would
> be a sort of decoupling of 1 and 2. Assuming we design something that still
> allows eliding conversions, I expect that will look nearly identical.
> Coders are representation to/from Bytes, while this conversion layer is T
> to/from representation. So I might think we can do it without adding
> anything to the model.
>
> Kenn
>
> On Tue, Jun 18, 2019 at 11:04 PM Reuven Lax  wrote:
>
>> Robert, you are correct that in principle the to/from functions are
>> needed on the operation, as that's where automatic conversion happens (in
>> Java it happens in DoFnRunner). However there are two blockers there:
>>
>> 1. As Brian mentioned, the issue in Java is that we never have
>> PCollection in this case. The source PCollection will simply be
>> PCollection, where T has a schema. The to/from functions are now
>> required to interpret this PCollection. Currently we need to put it on the
>> PCollection itself to may Java's type system happy (an alternative is to
>> always create an intermediate PCollection, but that would be
>> computationally expensive). We might be able to find a way to model this in
>> Java with the to/from on the operation, however I suspect it would be
>> difficult and a lot of work.
>>
>> 2. I believe there are some cases where PTransforms access the to/from
>> functions in expand(), which is before we have an operation to attach the
>> those functions to. Again this is presumably solvable, but would require
>> design and more work.
>>
>> 3. Currently the user can call setSchema on any PCollection, and pass in
>> to/from functions there. We would have to rethink this API.
>>
>> So I think leaving it in the coder is the pragmatic approach for now,
>> though it would be interesting to see if we could solve the above issues
>> and instead automatically propagate the functions to the operation.
>>
>> I agree that we should not make these things opaque in the portable
>> representation, if only for ease of debugging. However they should not be
>> needed for cross-language calls.
>>
>> Reuven
>>
>> On Tue, Jun 18, 2019 at 5:09 AM Robert Bradshaw 
>> wrote:
>>
>>> Thanks for updating that alternative.
>>>
>>> As for the to/from functions, it does seem pragmatic to dangle them
>>> off the purely portable representation (either as a field there, or as
>>> an opaque logical type whose payload contains the to/from functions,
>>> or a separate coder that wraps the schema coder (though I can't see
>>> how the latter would work well if nesting is allowed)) until we figure
>>> out a good way to attach them to the operations themselves.
>>>
>>> On Tue, Jun 18, 2019 at 2:37 AM Brian Hulette 
>>> wrote:
>>> >
>>> > Realized I completely ignored one of your points, added another
>>> response inline.
>>> >
>>> > On Fri, Jun 14, 2019 at 2:20 AM Robert Bradshaw 
>>> wrote:
>>> >>
>>> >> On Thu, Jun 13, 2019 at 8:42 PM Reuven Lax  wrote:
>>> >> >
>>> >> > Spoke to Brian about his proposal. It is essentially this:
>>> >> >
>>> >> > We create PortableSchemaCoder, with a well-known URN. This coder is
>>> parameterized by the schema (i.e. list of field name -> field type pairs).
>>> >>
>>> >> Given that we have a field type that is (list of

Re: Improve container support

2019-08-28 Thread Gleb Kanterov

Google Doc doesn't seem to be shared with dev@. Can anybody double-check?

On Wed, Aug 28, 2019 at 7:36 AM Hannah Jiang  wrote:

> add dev@
>
> On Tue, Aug 27, 2019 at 9:29 PM Hannah Jiang 
> wrote:
>
>> Thanks for commenting and discussions.
>> I created a Google Docs
>> 
>>  for
>> easy commenting and reviewing. From this moment, all changes will be
>> updated to the Google Docs and I will sync to wiki after finalize all plans.
>>
>> Thanks,
>> Hannah
>>
>> On Tue, Aug 27, 2019 at 9:24 PM Ahmet Altay  wrote:
>>
>>> Hi datapls-engprod,
>>>
>>> I have a question. Do you know what would it take to create a new gcp
>>> project similar to apache-beam-testing for purposes of distributing gcr
>>> packages? We can use the same billing account.
>>>
>>> Hannah, Robert, depending on the complexity of creating another gcp
>>> project we can go with that, or simply create a new bintray account. Either
>>> way would give us a clean new project to publish artifacts.
>>>
>>> Ahmet
>>>
>>> -- Forwarded message -
>>> From: Robert Bradshaw 
>>> Date: Tue, Aug 27, 2019 at 6:48 PM
>>> Subject: Re: Improve container support
>>> To: dev 
>>>
>>>
>>> On Tue, Aug 27, 2019 at 6:20 PM Ahmet Altay  wrote:
>>> >
>>> > On Tue, Aug 27, 2019 at 5:50 PM Robert Bradshaw 
>>> wrote:
>>> >>
>>> >> On Tue, Aug 27, 2019 at 3:35 PM Hannah Jiang 
>>> wrote:
>>> >> >
>>> >> > Hi team
>>> >> >
>>> >> > I am working on improving docker container support for Beam. We
>>> would like to publish prebuilt containers for each release version and
>>> daily snapshot. Current work focuses on release images only and it would be
>>> part of the release process.
>>> >>
>>> >> This would be great!
>>> >>
>>> >> > The release images will be pushed to GCR which is publicly
>>> accessible(pullable). We will use the following locations.
>>> >> > Repository: gcr.io/beam
>>> >> > Project: apache-beam-testing
>>> >>
>>> >> Given that these are release artifacts, we should use a project with
>>> >> more restricted access than "anyone who opens a PR on github."
>>> >
>>> >
>>> > We have two options:
>>> > -  gcr.io works based on the permissions of the gcs bucket that is
>>> backing it. GCS supports bucket only permissions. These permissions needs
>>> to be explicitly granted and the service accounts used by jenkins jobs does
>>> not have these explicit permissions today.
>>> > - we can create a new project in gcr, bintray or anything else that
>>> offers the same service.
>>>
>>> I think the cleanest is to simply have a new project whose membership
>>> consists of (interested) PMC members. If we have to populate this
>>> manually I think that'd still be OK as the churn is quite low.
>>>
>>

-- 
Cheers,
Gleb

Re: [ANNOUNCE] New committer: Valentyn Tymofieiev

2019-08-27 Thread Gleb Kanterov

Congratulations Valentyn!

On Tue, Aug 27, 2019 at 7:22 AM jincheng sun 
wrote:

> Congrats Valentyn!
>
> Best,
> Jincheng
>
> Ankur Goenka  于2019年8月27日周二 上午10:37写道：
>
>> Congratulations Valentyn!
>>
>> On Mon, Aug 26, 2019, 5:02 PM Yifan Zou  wrote:
>>
>>> Congratulations, Valentyn! Well deserved!
>>>
>>> On Mon, Aug 26, 2019 at 3:31 PM Aizhamal Nurmamat kyzy <
>>> aizha...@google.com> wrote:
>>>
 Congratulations! and thank you for your contributions, Valentyn!

 On Mon, Aug 26, 2019 at 3:26 PM Thomas Weise  wrote:

> Congrats!
>
>
> On Mon, Aug 26, 2019 at 3:22 PM Heejong Lee 
> wrote:
>
>> Congratulations! :)
>>
>> On Mon, Aug 26, 2019 at 2:44 PM Rui Wang  wrote:
>>
>>> Congratulations!
>>>
>>>
>>> -Rui
>>>
>>> On Mon, Aug 26, 2019 at 2:36 PM Hannah Jiang 
>>> wrote:
>>>
 Congratulations Valentyn, well deserved!

 On Mon, Aug 26, 2019 at 2:34 PM Chamikara Jayalath <
 chamik...@google.com> wrote:

> Congrats Valentyn!
>
> On Mon, Aug 26, 2019 at 2:32 PM Pablo Estrada 
> wrote:
>
>> Thanks Valentyn!
>>
>> On Mon, Aug 26, 2019 at 2:29 PM Robin Qiu 
>> wrote:
>>
>>> Thank you Valentyn! Congratulations!
>>>
>>> On Mon, Aug 26, 2019 at 2:28 PM Robert Bradshaw <
>>> rober...@google.com> wrote:
>>>
 Hi,

 Please join me and the rest of the Beam PMC in welcoming a new
 committer: Valentyn Tymofieiev

 Valentyn has made numerous contributions to Beam over the last
 several
 years (including 100+ pull requests), most recently pushing
 through
 the effort to make Beam compatible with Python 3. He is also an
 active
 participant in design discussions on the list, participates in
 release
 candidate validation, and proactively helps keep our tests
 green.

 In consideration of Valentyn's contributions, the Beam PMC
 trusts him
 with the responsibilities of a Beam committer [1].

 Thank you, Valentyn, for your contributions and looking forward
 to many more!

 Robert, on behalf of the Apache Beam PMC

 [1]
 https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer

>>>

-- 
Cheers,
Gleb

Java serialization for coders and compatibility

2019-08-13 Thread Gleb Kanterov

I'm looking into the code of AvroCoder, and I was wondering what happens
when users upgrade Beam for streaming pipelines?

As I understand it, we should be able to deserialize coder from previous
Beam version. Looking into guava vendoring, it's going to break
serialization when we are going to switch guava version because the current
version is a part of the namespace:

import
org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Supplier;
import
org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Suppliers;

We don't have tests for it, but probably we already broke compatibility
when we vendored guava. Can anybody clarify what would be the approach for
coders?

Re: [VOTE] Support ZetaSQL as another SQL dialect for BeamSQL in Beam repo

2019-08-13 Thread Gleb Kanterov

+1

On Tue, Aug 13, 2019 at 10:47 AM Ismaël Mejía  wrote:

> +1
> Wishing that this goes to calcite too someday (hoping that it makes
> Beam side maintenance simpler)
>
> On Tue, Aug 13, 2019 at 6:18 AM Manu Zhang 
> wrote:
> >
> > +1
> >
> > On Tue, Aug 13, 2019 at 11:55 AM Mingmin Xu  wrote:
> >>
> >> +1
> >>
> >> On Mon, Aug 12, 2019 at 8:53 PM Ryan McDowell 
> wrote:
> >>>
> >>> +1
> >>>
> >>> On Mon, Aug 12, 2019 at 8:30 PM Reza Rokni  wrote:
> 
>  +1
> 
>  On Tue, 13 Aug 2019 at 09:28, Ahmet Altay  wrote:
> >
> > +1
> >
> > On Mon, Aug 12, 2019 at 6:27 PM Kenneth Knowles 
> wrote:
> >>
> >> +1
> >>
> >> On Mon, Aug 12, 2019 at 4:43 PM Rui Wang  wrote:
> >>>
> >>> Hi Community,
> >>>
> >>> I am using this separate thread to collect votes on contributing
> Beam ZetaSQL(my way to say ZetaSQL as a dialect supported by BeamSQL) to
> Beam repo.
> >>>
> >>> There are discussions related to benefits, technical design and
> others on Beam ZetaSQL in [1]. The Beam ZetaSQL code lives in [2]. Note
> that this vote is not about merging the PR, which should be decided by code
> review. This vote is only to vote if Beam ZetaSQL should live in Beam repo.
> >>>
> >>> +1: Beam repo can host Beam ZetaSQL
> >>> -1: Beam repo should not host Beam ZetaSQL
> >>>
> >>> If there are more questions related to Beam ZetaSQL, please
> discuss it in [1].
> >>>
> >>> [1]:
> https://lists.apache.org/thread.html/eab70bb99218aaedfd506e979967379c0efa05ea56a792a1486f9c74@%3Cdev.beam.apache.org%3E
> >>> [2]: https://github.com/apache/beam/pull/9210
> >>>
> >>> -Rui
> 
> 
> 
>  --
> 
>  This email may be confidential and privileged. If you received this
> communication by mistake, please don't forward it to anyone else, please
> erase all copies and attachments, and please let me know that it has gone
> to the wrong person.
> 
>  The above terms reflect a potential business arrangement, are
> provided solely as a basis for further discussion, and are not intended to
> be and do not constitute a legally binding obligation. No legally binding
> obligations will be created, implied, or inferred until an agreement in
> final form is executed in writing by all parties involved.
> >>
> >>
> >>
> >> --
> >> 
> >> Mingmin
>


-- 
Cheers,
Gleb

Re: [ANNOUNCE] New committer: Rui Wang

2019-08-07 Thread Gleb Kanterov

Congratulations Rui! Well done!

On Wed, Aug 7, 2019 at 7:01 AM Connell O'Callaghan 
wrote:

> Well done Rui!!!
>
> On Tue, Aug 6, 2019 at 15:41 Chamikara Jayalath 
> wrote:
>
>> Congrats Rui.
>>
>> On Tue, Aug 6, 2019 at 2:00 PM Melissa Pashniak 
>> wrote:
>>
>>> Congrats Rui!
>>>
>>> On Tue, Aug 6, 2019 at 1:37 PM Yichi Zhang  wrote:
>>>
 Congrats Rui!

 On Tue, Aug 6, 2019 at 1:29 PM Aizhamal Nurmamat kyzy <
 aizha...@google.com> wrote:

> Congratulations, Rui! Thank you for your contributions to Beam!
>
> On Tue, Aug 6, 2019 at 10:35 AM sridhar inuog 
> wrote:
>
>> Congratulations Rui!
>>
>> On Tue, Aug 6, 2019 at 12:09 PM Hannah Jiang 
>> wrote:
>>
>>> Congrats Rui!
>>>
>>> On Tue, Aug 6, 2019 at 9:50 AM Yifan Zou 
>>> wrote:
>>>
 Congratulations Rui!

 On Tue, Aug 6, 2019 at 9:47 AM Anton Kedin 
 wrote:

> Congrats!
>
> On Tue, Aug 6, 2019, 9:36 AM Ankur Goenka 
> wrote:
>
>> Congratulations Rui!
>> Well deserved 
>>
>> On Tue, Aug 6, 2019 at 9:35 AM Ahmet Altay 
>> wrote:
>>
>>> Hi,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a new
>>> committer: Rui Wang.
>>>
>>> Rui has been an active contributor since May 2018. Rui has been
>>> very active in Beam SQL [1] and continues to help out on user@
>>> and StackOverflow. Rui is one of the top answerers for apache-beam 
>>> tag [2].
>>>
>>> In consideration of Rui's contributions, the Beam PMC trusts him
>>> with the responsibilities of a Beam committer [3].
>>>
>>> Thank you, Rui, for your contributions and looking forward to
>>> many more!
>>>
>>> Ahmet, on behalf of the Apache Beam PMC
>>>
>>> [1]
>>> https://github.com/apache/beam/pulls?q=is%3Apr+author%3Aamaliujia
>>> [2] https://stackoverflow.com/tags/apache-beam/topusers
>>> [3] https://beam.apache.org/contribute/become-a-committer
>>> /#an-apache-beam-committer
>>>
>>

-- 
Cheers,
Gleb

Re: [ANNOUNCE] New committer: Kyle Weaver

2019-08-07 Thread Gleb Kanterov

Congratulations!

On Wed, Aug 7, 2019 at 7:01 AM Connell O'Callaghan 
wrote:

> Well done congratulations Kyle!!!
>
> On Tue, Aug 6, 2019 at 21:58 Thomas Weise  wrote:
>
>> Congrats!
>>
>> On Tue, Aug 6, 2019, 7:24 PM Reza Rokni  wrote:
>>
>>> Congratz!
>>>
>>> On Wed, 7 Aug 2019 at 06:40, Chamikara Jayalath 
>>> wrote:
>>>
 Congrats!!

 On Tue, Aug 6, 2019 at 3:33 PM Udi Meiri  wrote:

> Congrats Kyle!
>
> On Tue, Aug 6, 2019 at 2:00 PM Melissa Pashniak 
> wrote:
>
>> Congratulations Kyle!
>>
>> On Tue, Aug 6, 2019 at 1:36 PM Yichi Zhang  wrote:
>>
>>> Congrats Kyle!
>>>
>>> On Tue, Aug 6, 2019 at 1:29 PM Aizhamal Nurmamat kyzy <
>>> aizha...@google.com> wrote:
>>>
 Thank you, Kyle! And congratulations :)

 On Tue, Aug 6, 2019 at 10:09 AM Hannah Jiang <
 hannahji...@google.com> wrote:

> Congrats Kyle!
>
> On Tue, Aug 6, 2019 at 9:52 AM David Morávek <
> david.mora...@gmail.com> wrote:
>
>> Congratulations Kyle!!
>>
>> Sent from my iPhone
>>
>> On 6 Aug 2019, at 18:47, Anton Kedin  wrote:
>>
>> Congrats!
>>
>> On Tue, Aug 6, 2019, 9:37 AM Ankur Goenka 
>> wrote:
>>
>>> Congratulations Kyle!
>>>
>>> On Tue, Aug 6, 2019 at 9:35 AM Ahmet Altay 
>>> wrote:
>>>
 Hi,

 Please join me and the rest of the Beam PMC in welcoming a new
 committer: Kyle Weaver.

 Kyle has been contributing to Beam for a while now. And in that
 time period Kyle got the portable spark runner feature complete 
 for batch
 processing. [1]

 In consideration of Kyle's contributions, the Beam PMC trusts
 him with the responsibilities of a Beam committer [2].

 Thank you, Kyle, for your contributions and looking forward to
 many more!

 Ahmet, on behalf of the Apache Beam PMC

 [1]
 https://lists.apache.org/thread.html/c43678fc24c9a1dc9f48c51c51950aedcb9bc0fd3b633df16c3d595a@%3Cuser.beam.apache.org%3E
 [2] https://beam.apache.org/contribute/become-a-committer
 /#an-apache-beam-committer

>>>
>>>
>>> --
>>>
>>> This email may be confidential and privileged. If you received this
>>> communication by mistake, please don't forward it to anyone else, please
>>> erase all copies and attachments, and please let me know that it has gone
>>> to the wrong person.
>>>
>>> The above terms reflect a potential business arrangement, are provided
>>> solely as a basis for further discussion, and are not intended to be and do
>>> not constitute a legally binding obligation. No legally binding obligations
>>> will be created, implied, or inferred until an agreement in final form is
>>> executed in writing by all parties involved.
>>>
>>

-- 
Cheers,
Gleb

Re: [ANNOUNCE] New committer: Jan Lukavský

2019-08-01 Thread Gleb Kanterov

Congratulations!

On Thu, Aug 1, 2019 at 3:11 PM Reza Rokni  wrote:

> Congratulations , awesome stuff !
>
> On Thu, 1 Aug 2019, 12:11 Maximilian Michels,  wrote:
>
>> Congrats, Jan! Good to see you become a committer :)
>>
>> On 01.08.19 12:37, Łukasz Gajowy wrote:
>> > Congratulations!
>> >
>> > czw., 1 sie 2019 o 11:16 Robert Bradshaw > > > napisał(a):
>> >
>> > Congratulations!
>> >
>> > On Thu, Aug 1, 2019 at 9:59 AM Jan Lukavský > > > wrote:
>> >
>> > Thanks everyone!
>> >
>> > Looking forward to working with this great community! :-)
>> >
>> > Cheers,
>> >
>> >  Jan
>> >
>> > On 8/1/19 12:18 AM, Rui Wang wrote:
>> > > Congratulations!
>> > >
>> > >
>> > > -Rui
>> > >
>> > > On Wed, Jul 31, 2019 at 10:51 AM Robin Qiu <
>> robi...@google.com
>> > > > wrote:
>> > >
>> > > Congrats!
>> > >
>> > > On Wed, Jul 31, 2019 at 10:31 AM Aizhamal Nurmamat kyzy
>> > > mailto:aizha...@apache.org>> wrote:
>> > >
>> > > Congratulations, Jan! Thank you for your
>> contributions!
>> > >
>> > > On Wed, Jul 31, 2019 at 10:04 AM Tanay Tummalapalli
>> > > mailto:ttanay...@gmail.com>>
>> wrote:
>> > >
>> > > Congratulations!
>> > >
>> > > On Wed, Jul 31, 2019 at 10:05 PM Ahmet Altay
>> > > mailto:al...@google.com>>
>> wrote:
>> > >
>> > > Congratulations Jan! Thank you for your
>> > > contributions!
>> > >
>> > > On Wed, Jul 31, 2019 at 2:30 AM Ankur Goenka
>> > > mailto:goe...@google.com
>> >>
>> > > wrote:
>> > >
>> > > Congratulations Jan!
>> > >
>> > > On Wed, Jul 31, 2019, 1:23 AM David
>> > > Morávek > > > > wrote:
>> > >
>> > > Congratulations Jan, well deserved! ;)
>> > >
>> > > D.
>> > >
>> > > On Wed, Jul 31, 2019 at 10:17 AM Ryan
>> > > Skraba > > > > wrote:
>> > >
>> > > Congratulations Jan!
>> > >
>> > > On Wed, Jul 31, 2019 at 10:10 AM
>> > > Ismaël Mejía > > > >
>> wrote:
>> > > >
>> > > > Hi,
>> > > >
>> > > > Please join me and the rest of
>> > > the Beam PMC in welcoming a new
>> > > > committer: Jan Lukavský.
>> > > >
>> > > > Jan has been contributing to
>> > > Beam for a while, he was part of
>> > > the team
>> > > > that contributed the Euphoria
>> > > DSL extension, and he has done
>> > > > interesting improvements for the
>> > > Spark and Direct runner. He has
>> also
>> > > > been active in the community
>> > > discussions around the Beam model
>> and
>> > > > other subjects.
>> > > >
>> > > > In consideration of Jan's
>> > > contributions, the Beam PMC trusts
>> > > him with
>> > > > the responsibilities of a Beam
>> > > committer [1].
>> > > >
>> > > > Thank you, Jan, for your
>> > > contributions and looking forward
>> > > to many more!
>> > > >
>> > > > Ismaël, on behalf of the Apache
>> > > Beam PMC
>> > > >
>> > > > [1]
>> > >
>> https://beam.apache.org/committer/committer
>> > >
>>
>>

-- 
Cheers,
Gleb

Re: Sort Merge Bucket - Action Items

2019-07-25 Thread Gleb Kanterov

What is the long-term plan for org.apache.beam.sdk.io.Read? Is it going
away in favor of SDF, or we are always going to have both?

I was looking into AvroIO.read and AvroIO.readAll, both of them
use AvroSource. AvroIO.readAll is using SDF, and it's implemented with
ReadAllViaFileBasedSource that takes AvroSource as a parameter. Looking at
ReadAllViaFileBasedSource I find it not necessary to use Source, it
should be enough to have something like (KV,
OutputReceiver), as we have discussed in this thread, and that should be
fine for SMB as well. It would require duplicating code from AvroSource,
but in the end, I don't see it as a problem if AvroSource is going away.

I'm attaching a small diagram I put for myself to better understand the
code.

AvroIO.readAll :: PTransform> ->

FileIO.matchAll :: PTransform,
PCollection>
FileIO.readMatches :: PTransform,
PCollection>
AvroIO.readFiles :: PTransform,
PCollection> ->

ReadAllViaFileBasedSource :: PTransform,
PCollection> ->

ParDo.of(SplitIntoRangesFn :: DoFn>) (splittable do fn)

Reshuffle.viaRandomKey()

ParDo.of(ReadFileRangesFn(createSource) :: DoFn, T>) where

createSource :: String -> FileBasedSource

createSource = AvroSource


AvroIO.read without getHintMatchedManyFiles() :: PTransform> ->

Read.Bounded.from(createSource) where

createSource :: String -> FileBasedSource

createSource = AvroSource


Gleb


On Thu, Jul 25, 2019 at 2:41 PM Robert Bradshaw  wrote:

> On Thu, Jul 25, 2019 at 12:35 AM Kenneth Knowles  wrote:
> >
> > From the peanut gallery, keeping a separate implementation for SMB seems
> fine. Dependencies are serious liabilities for both upstream and
> downstream. It seems like the reuse angle is generating extra work, and
> potentially making already-complex implementations more complex, instead of
> helping things.
>
> +1
>
> To be clear, what I care about is that WriteFiles(X) and
> WriteSmbFiles(X) can share the same X, for X in {Avro, Parquet, Text,
> TFRecord, ...}. In other words composability of the API (vs. manually
> filling out the matrix). If WriteFiles and WriteSmbFiles find
> opportunities for (easy, clean) implementation sharing, that'd be
> nice, but not the primary goal.
>
> (Similarly for reading, though that's seem less obvious. Certainly
> whatever T is useful for ReadSmb(T) could be useful for a
> (non-liquid-shading) ReadAll(T) however.)
>
> > On Wed, Jul 24, 2019 at 11:59 AM Neville Li 
> wrote:
> >>
> >> I spoke too soon. Turns out for unsharded writes, numShards can't be
> determined until the last finalize transform, which is again different from
> the current SMB proposal (static number of buckets & shards).
> >> I'll end up with more code specialized for SMB in order to generalize
> existing sink code, which I think we all want to avoid.
> >>
> >> Seems the only option is duplicating some logic like temp file
> handling, which is exactly what we did in the original PR.
> >> I can reuse Compression & Sink for file level writes but that seems
> about the most I can reuse right now.
> >>
> >> On Tue, Jul 23, 2019 at 6:36 PM Neville Li 
> wrote:
> >>>
> >>> So I spent one afternoon trying some ideas for reusing the last few
> transforms WriteFiles.
> >>>
> >>> WriteShardsIntoTempFilesFn extends DoFn,
> Iterable>, FileResult>
> >>> => GatherResults extends PTransform,
> PCollection>>
> >>> => FinalizeTempFileBundles extends
> PTransform>>,
> WriteFilesResult>
> >>>
> >>> I replaced FileResult with KV
> so I can use pre-compute SMB destination file names for the transforms.
> >>> I'm also thinking of parameterizing ShardedKey for SMB's
> bucket/shard to reuse WriteShardsIntoTempFilesFn. These transforms are
> private and easy to change/pull out.
> >>>
> >>> OTOH they are somewhat coupled with the package private
> {Avro,Text,TFRecord}Sink and their WriteOperation impl (where the bulk of
> temp file handing logic lives). Might be hard to decouple either modifying
> existing code or creating new transforms, unless if we re-write most of
> FileBasedSink from scratch.
> >>>
> >>> Let me know if I'm on the wrong track.
> >>>
> >>> WIP Branch https://github.com/spotify/beam/tree/neville/write-files
> >>>
> >>> On Tue, Jul 23, 2019 at 4:22 PM Chamikara Jayalath <
> chamik...@google.com> wrote:
> 
> 
> 
>  On Mon, Jul 22, 2019 at 1:41 PM Robert Bradshaw 
> wrote:
> >
> > On Mon, Jul 22, 2019 at 7:39 PM Eugene Kirpichov <
> kirpic...@google.com> wrote:
> > >
> > > On Mon, Jul 22, 2019 at 7:49 AM Robert Bradshaw <
> rober...@google.com> wrote:
> > >>
> > >> On Mon, Jul 22, 2019 at 4:04 PM Neville Li 
> wrote:
> > >> >
> > >> > Thanks Robert. Agree with the FileIO point. I'll look into it
> and see what needs to be done.
> > >> >
> > >> > Eugene pointed out that we shouldn't build on
> FileBased{Source,Sink}. So for writes I'll probably build on top of
> WriteFiles.
> > >>
> > >> Meaning it could be parameterized by FileIO.Sink, right?
> > >>
> > >>
>

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-17 Thread Gleb Kanterov

>
> Suppose one assigns a sharding function to a PCollection. Is it lazy,
> or does it induce a reshuffle right at that point? In either case,
> once the ShardingFn has been applied, how long does it remain in
> effect? Does it prohibit the runner (or user) from doing subsequent
> resharding (including dynamic load balancing)? What happens when one
> has a DoFn that changes the value? (Including the DoFns in our sinks
> that assign random keys.)


What if we would reason about sharding in the same way as we reason about
timestamps?

Please correct me if I am wrong, as I know, in Beam, timestamps exist for
each element. You can get timestamp by using Reify.timestamps. If there are
timestamped values, and they go through ParDo, timestamps are preserved. We
can think of the same with sharding, where Reify.shards would be
PTransform, ShardedValue> and ShardedValue would
contain shard and a grouping key. ParDo wouldn't change sharding and would
propagate ShardingFn. CoGroupByKey on such PTransforms would reify grouping
key, and do regular CoGroupByKey, or be rewritten to a regular ParDo if
sharding of inputs is compatible.

As you mentioned, it requires dynamic work rebalancing to preserve
sharding. What if we do dynamic work rebalancing for each shard
independently, as, I guess, it's done today for fixed windows. When we do a
split, we would split one shard into two. It should be possible to do
consistently if values within buckets are sorted, in this case, we would
split ranges of possible values.






On Wed, Jul 17, 2019 at 6:37 PM Robert Bradshaw  wrote:

> On Wed, Jul 17, 2019 at 4:26 PM Gleb Kanterov  wrote:
> >
> > I find there is an interesting point in the comments brought by Ahmed
> Eleryan. Similar to WindowFn, having a concept of ShardingFn, that enables
> users to implement a class for sharding data. Each Beam node can have
> ShardingFn set, similar to WindowFn (or WindowingStrategy). Sinks and
> sources are aware of that and preserve this information. Using that it's
> possible to do optimization on Beam graph, removing redundant CoGroupByKey,
> and it would be transparent to users.
> >
> > It feels like a nice addition to the Beam model, or possibly we can
> implement it using existing windowing mechanics. There are people on the
> list with strong experience in the area, I'm wondering what do you think?
>
> I've actually thought about this some, though it's been quite a while.
> At the time it seemed hard to work it into a cohesive part of the
> model (even ignoring the fact that sharding is primarily an execution,
> rather than logical, property).
>
> Suppose one assigns a sharding function to a PCollection. Is it lazy,
> or does it induce a reshuffle right at that point? In either case,
> once the ShardingFn has been applied, how long does it remain in
> effect? Does it prohibit the runner (or user) from doing subsequent
> resharding (including dynamic load balancing)? What happens when one
> has a DoFn that changes the value? (Including the DoFns in our sinks
> that assign random keys.)
>
> Right now one can get most of the semantics of sharding by keying by
> the shard id and doing a GBK, where the resulting value set (which is
> allowed to be arbitrarily big) is the (indivisible) shard (e.g. for
> writing to a single file.)
>
> I think sharding (like ordering, the two are quite related) is a
> property that a PCollection can have, and could be leveraged by the
> optimizer, but it's difficult to see how it's propagated through
> transforms. The most sane way to reason about it IMHO is similar to
> sink triggers, where one specifies that one wants a sharding at some
> point, and the runner arranges things upstream such that it is so, and
> some operations can declare that they happen to produce data sharded
> in some way (though again, PCollection to PCollection one needs a
> consistent notion of key to have a consistent notion of sharding).
>
> > Gleb
> >
> > On Tue, Jul 16, 2019 at 11:34 PM Eugene Kirpichov 
> wrote:
> >>
> >> I'd like to reiterate the request to not build anything on top of
> FileBasedSource/Reader.
> >> If the design requires having some interface for representing a
> function from a filename to a stream of records, better introduce a new
> interface for that.
> >> If it requires interoperability with other IOs that read files, better
> change them to use the new interface.
> >>
> >> On Tue, Jul 16, 2019 at 9:08 AM Chamikara Jayalath <
> chamik...@google.com> wrote:
> >>>
> >>> Thanks this clarifies a lot.
> >>>
> >>> For writer, I think it's great if you can utilize existing FileIO.Sink
> implementations even if you have to reimplement some of the logic (for

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-17 Thread Gleb Kanterov

*Note on reader block/offset/split requirement*
>>>
>>>- Because of the merge sort, we can't split or offset seek a bucket
>>>file. Because without persisting the offset index of a key group 
>>> somewhere,
>>>we can't efficiently skip to a key group without exhausting the previous
>>>ones. Furthermore we need to merge sort and align keys from multiple
>>>sources, which may not have the same key distribution. It might be 
>>> possible
>>>to binary search for matching keys but that's extra complication. IMO the
>>>reader work distribution is better solved by better bucket/shard strategy
>>>in upstream writer.
>>>
>>> *References*
>>>
>>>1. ReadMatches extends PTransform,
>>>PCollection>
>>>2. ReadAllViaFileBasedSource extends
>>>PTransform, PCollection>
>>>3. WriteFiles extends
>>>PTransform, WriteFilesResult>
>>>4. WriteShardsIntoTempFilesFn extends DoFn,
>>>Iterable>, FileResult>
>>>5. FinalizeTempFileBundles extends PTransform<
>>>PCollection>>, 
>>> WriteFilesResult>
>>>
>>>
>>> On Tue, Jul 16, 2019 at 5:15 AM Robert Bradshaw 
>>> wrote:
>>>
>>>> On Mon, Jul 15, 2019 at 7:03 PM Eugene Kirpichov 
>>>> wrote:
>>>> >
>>>> > Quick note: I didn't look through the document, but please do not
>>>> build on either FileBasedSink or FileBasedReader. They are both remnants of
>>>> the old, non-composable IO world; and in fact much of the composable IO
>>>> work emerged from frustration with their limitations and recognizing that
>>>> many other IOs were suffering from the same limitations.
>>>> > Instead of FileBasedSink, build on FileIO.write; instead of
>>>> FileBasedReader, build on FileIO.read.
>>>>
>>>> +1
>>>>
>>>> I think the sink could be written atop FileIO.write, possibly using
>>>> dynamic destinations. At the very least the FileSink interface, which
>>>> handles the details of writing a single shard, would be an ideal way
>>>> to parameterize an SMB sink. It seems that none of our existing IOs
>>>> (publically?) expose FileSink implementations.
>>>>
>>>> FileIO.read is not flexible enough to do the merging. Eugene, is there
>>>> a composable analogue to FileSink, for sources, i.e. something that
>>>> can turn a file handle (possibly with offsets) into a set of records
>>>> other than FileBasedReader?
>>>>
>>>> > On Mon, Jul 15, 2019 at 9:01 AM Gleb Kanterov 
>>>> wrote:
>>>> >>
>>>> >> I share the same concern with Robert regarding re-implementing parts
>>>> of IO. At the same time, in the past, I worked on internal libraries that
>>>> try to re-use code from existing IO, and it's hardly possible because it
>>>> feels like it wasn't designed for re-use. There are a lot of classes that
>>>> are nested (non-static) or non-public. I can understand why they were made
>>>> non-public, it's a hard abstraction to design well and keep compatibility.
>>>> As Neville mentioned, decoupling readers and writers would not only benefit
>>>> for this proposal but for any other use-case that has to deal with
>>>> low-level API such as FileSystem API, that is hardly possible today without
>>>> copy-pasting,
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Mon, Jul 15, 2019 at 5:05 PM Neville Li 
>>>> wrote:
>>>> >>>
>>>> >>> Re: avoiding mirroring IO functionality, what about:
>>>> >>>
>>>> >>> - Decouple the nested FileBasedSink.Writer and
>>>> FileBasedSource.FileBasedReader, make them top level and remove references
>>>> to parent classes.
>>>> >>> - Simplify the interfaces, while maintaining support for
>>>> block/offset read & sequential write.
>>>> >>> - As a bonus, the refactored IO classes can be used standalone in
>>>> case when the user wants to perform custom IO in a DoFn, i.e. a
>>>> PTransform, PCollection>>. Today
>>>> this requires a lot of copy-pasted Avro boilerplate.
>>>> >>> - For compatibility, we can delegate to the new classes from the
&

Re: [ANNOUNCE] New committer: Robert Burke

2019-07-17 Thread Gleb Kanterov

Congratulations, Robert!

On Wed, Jul 17, 2019 at 1:50 PM Robert Bradshaw  wrote:

> Congratulations!
>
> On Wed, Jul 17, 2019, 12:56 PM Katarzyna Kucharczyk <
> ka.kucharc...@gmail.com> wrote:
>
>> Congratulations! :)
>>
>> On Wed, Jul 17, 2019 at 12:46 PM Michał Walenia <
>> michal.wale...@polidea.com> wrote:
>>
>>> Congratulations, Robert! :)
>>>
>>> On Wed, Jul 17, 2019 at 12:45 PM Łukasz Gajowy 
>>> wrote:
>>>
 Congratulations! :)

 śr., 17 lip 2019 o 04:30 Rakesh Kumar 
 napisał(a):

> Congrats Rob!!!
>
> On Tue, Jul 16, 2019 at 10:24 AM Ahmet Altay  wrote:
>
>> Hi,
>>
>> Please join me and the rest of the Beam PMC in welcoming a new
>> committer: Robert Burke.
>>
>> Robert has been contributing to Beam and actively involved in the
>> community for over a year. He has been actively working on Go SDK, 
>> helping
>> users, and making it easier for others to contribute [1].
>>
>> In consideration of Robert's contributions, the Beam PMC trusts him
>> with the responsibilities of a Beam committer [2].
>>
>> Thank you, Robert, for your contributions and looking forward to many
>> more!
>>
>> Ahmet, on behalf of the Apache Beam PMC
>>
>> [1]
>> https://lists.apache.org/thread.html/8f729da2d3009059d7a8b2d8624446be161700dcfa953939dd3530c6@%3Cdev.beam.apache.org%3E
>> [2] https://beam.apache.org/contribute/become-a-committer
>> /#an-apache-beam-committer
>>
>
>>>
>>> --
>>>
>>> Michał Walenia
>>> Polidea  | Software Engineer
>>>
>>> M: +48 791 432 002 <+48791432002>
>>> E: michal.wale...@polidea.com
>>>
>>> Unique Tech
>>> Check out our projects! 
>>>
>>

-- 
Cheers,
Gleb

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-15 Thread Gleb Kanterov

I share the same concern with Robert regarding re-implementing parts of IO.
At the same time, in the past, I worked on internal libraries that try to
re-use code from existing IO, and it's hardly possible because it feels
like it wasn't designed for re-use. There are a lot of classes that are
nested (non-static) or non-public. I can understand why they were made
non-public, it's a hard abstraction to design well and keep compatibility.
As Neville mentioned, decoupling readers and writers would not only benefit
for this proposal but for any other use-case that has to deal with
low-level API such as FileSystem API, that is hardly possible today without
copy-pasting,





On Mon, Jul 15, 2019 at 5:05 PM Neville Li  wrote:

> Re: avoiding mirroring IO functionality, what about:
>
> - Decouple the nested FileBasedSink.Writer and
> FileBasedSource.FileBasedReader, make them top level and remove references
> to parent classes.
> - Simplify the interfaces, while maintaining support for block/offset read
> & sequential write.
> - As a bonus, the refactored IO classes can be used standalone in case
> when the user wants to perform custom IO in a DoFn, i.e. a
> PTransform, PCollection>>. Today
> this requires a lot of copy-pasted Avro boilerplate.
> - For compatibility, we can delegate to the new classes from the old ones
> and remove them in the next breaking release.
>
> Re: WriteFiles logic, I'm not sure about generalizing it, but what about
> splitting the part handling writing temp files into a new
> PTransform>>,
> PCollection>>? That splits the bucket-shard
> logic from actual file IO.
>
> On Mon, Jul 15, 2019 at 10:27 AM Robert Bradshaw 
> wrote:
>
>> I agree that generalizing the existing FileIO may not be the right
>> path forward, and I'd only make their innards public with great care.
>> (Would this be used like like
>> SmbSink(MyFileIO.sink(parameters).getWriter[Factory]())?) SMB is a bit
>> unique that the source and sink are much more coupled than other
>> sources and sinks (which happen to be completely independent, if
>> complementary implementations, whereas SMB attempts to be a kind of
>> pipe where one half is instanciated in each pipeline).
>>
>> In short, an SMB source/sink that is parameterized by an arbitrary,
>> existing IO would be ideal (but possibly not feasible (per existing
>> prioritizations)), or an SMB source/sink that works as a pair. What
>> I'd like to avoid is a set of parallel SMB IO classes that (partially,
>> and incompletely) mirror the existing IO ones (from an API
>> perspective--how much implementation it makes sense to share is an
>> orthogonal issue that I'm sure can be worked out.)
>>
>> On Mon, Jul 15, 2019 at 4:18 PM Neville Li  wrote:
>> >
>> > Hi Robert,
>> >
>> > I agree, it'd be nice to reuse FileIO logic of different file types.
>> But given the current code structure of FileIO & scope of the change, I
>> feel it's better left for future refactor PRs.
>> >
>> > Some thoughts:
>> > - SMB file operation is simple single file sequential reads/writes,
>> which already exists as Writer & FileBasedReader but are private inner
>> classes, and have references to the parent Sink/Source instance.
>> > - The readers also have extra offset/split logic but that can be worked
>> around.
>> > - It'll be nice to not duplicate temp->destination file logic but again
>> WriteFiles is assuming a single integer shard key, so it'll take some
>> refactoring to reuse it.
>> >
>> > All of these can be done in backwards compatible way. OTOH generalizing
>> the existing components too much (esp. WriteFiles, which is already
>> complex) might lead to two logic paths, one specialized for the SMB case.
>> It might be easier to decouple some of them for better reuse. But again I
>> feel it's a separate discussion.
>> >
>> > On Mon, Jul 15, 2019 at 9:45 AM Claire McGinty <
>> claire.d.mcgi...@gmail.com> wrote:
>> >>
>> >> Thanks Robert!
>> >>
>> >> We'd definitely like to be able to re-use existing I/O components--for
>> example the Writer/FileBasedReader (since they
>> operate on a WritableByteChannel/ReadableByteChannel, which is the level of
>> granularity we need) but the Writers, at least, seem to be mostly
>> private-access. Do you foresee them being made public at any point?
>> >>
>> >> - Claire
>> >>
>> >> On Mon, Jul 15, 2019 at 9:31 AM Robert Bradshaw 
>> wrote:
>> >>>
>> >>> I left some comments on the doc.
>> >>>
>> >>> I think the general idea is sound, but one thing that worries me is
>> >>> the introduction of a parallel set of IOs that mirrors the (existing)
>> >>> FileIOs. I would suggest either (1) incorporate this functionality
>> >>> into the generic FileIO infrastructure, or let it be parameterized by
>> >>> arbitrary IO (which I'm not sure is possible, especially for the Read
>> >>> side (and better would be the capability of supporting arbitrary
>> >>> sources, aka an optional "as-sharded-source" operation that returns a
>> >>> PTransform<..., KV>> where the iterable is

Re: [ANNOUNCE] New committer: Mikhail Gryzykhin

2019-06-25 Thread Gleb Kanterov

Congratulations!

On Tue, Jun 25, 2019 at 2:03 AM Connell O'Callaghan 
wrote:

> Thomas thank you for sharing this
>
> Congratulations on this Mikhail!!!
>
> On Mon, Jun 24, 2019 at 3:19 PM Kai Jiang  wrote:
>
>> Congrats!
>>
>> On Mon, Jun 24, 2019 at 1:46 PM Chamikara Jayalath 
>> wrote:
>>
>>> Congrats!!
>>>
>>> On Mon, Jun 24, 2019 at 11:12 AM Mikhail Gryzykhin 
>>> wrote:
>>>
 Thank you everyone.

 On Mon, Jun 24, 2019 at 2:28 AM Aizhamal Nurmamat kyzy <
 aizha...@google.com> wrote:

> Congrats Misha!
>
> On Mon, Jun 24, 2019 at 11:23 Łukasz Gajowy 
> wrote:
>
>> Congratulations Mikhail!
>>
>> pt., 21 cze 2019 o 22:09 Ruoyun Huang  napisał(a):
>>
>>> Congratulations! Mikhail!
>>>
>>>
>>> On Fri, Jun 21, 2019 at 1:00 PM Yichi Zhang 
>>> wrote:
>>>
 Congrats!

 On Fri, Jun 21, 2019 at 11:55 AM Tanay Tummalapalli <
 ttanay...@gmail.com> wrote:

> Congratulations!
>
> On Fri, Jun 21, 2019 at 10:35 PM Rui Wang 
> wrote:
>
>> Congrats!
>>
>>
>> -Rui
>>
>> On Fri, Jun 21, 2019 at 9:58 AM Robin Qiu 
>> wrote:
>>
>>> Congrats, Mikhail!
>>>
>>> On Fri, Jun 21, 2019 at 9:12 AM Alexey Romanenko <
>>> aromanenko@gmail.com> wrote:
>>>
 Congrats, Mikhail!

 On 21 Jun 2019, at 18:01, Anton Kedin  wrote:

 Congrats!

 On Fri, Jun 21, 2019 at 3:55 AM Reza Rokni 
 wrote:

> Congratulations!
>
> On Fri, 21 Jun 2019, 12:37 Robert Burke, 
> wrote:
>
>> Congrats
>>
>> On Fri, Jun 21, 2019, 12:29 PM Thomas Weise 
>> wrote:
>>
>>> Hi,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a
>>> new committer: Mikhail Gryzykhin.
>>>
>>> Mikhail has been contributing to Beam and actively involved
>>> in the community for over a year. He developed the community 
>>> build
>>> dashboard [1] and added substantial improvements to our build
>>> infrastructure. Mikhail's work also covers metrics, contributor
>>> documentation, development process improvements and other areas.
>>>
>>> In consideration of Mikhail's contributions, the Beam PMC
>>> trusts him with the responsibilities of a Beam committer [2].
>>>
>>> Thank you, Mikhail, for your contributions and looking
>>> forward to many more!
>>>
>>> Thomas, on behalf of the Apache Beam PMC
>>>
>>> [1] https://s.apache.org/beam-community-metrics
>>> 
>>> [2]
>>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>> 
>>>
>>>

>>>
>>> --
>>> 
>>> Ruoyun  Huang
>>>
>>>

-- 
Cheers,
Gleb

Re: Contributor Registration

2019-06-20 Thread Gleb Kanterov

Welcome Matt!

On Thu, Jun 20, 2019 at 11:09 AM Aizhamal Nurmamat kyzy 
wrote:

> Welcome Matt!
>
> On Thu, Jun 20, 2019 at 11:06 AM Robert Bradshaw 
> wrote:
>
>> Welcome! I added you to the contributors group.
>>
>> On Thu, Jun 20, 2019 at 11:03 AM Matt Helm  wrote:
>> >
>> > Hi Beam community,
>> >
>> > I'm Matt Helm, a Data Engineer at Shopify. I'm based in Vancouver,
>> Canada. As part of Beam Summit I'm looking to please start taking issues
>> from Jira. My username is matthelm.
>> >
>> > Thanks,
>> > Matt
>>
>

-- 
Cheers,
Gleb

Re: [ANNOUNCE] New PMC Member: Pablo Estrada

2019-06-10 Thread Gleb Kanterov

Congratulations!

On Fri, May 24, 2019 at 9:50 PM Joana Filipa Bernardo Carrasqueira <
joanafil...@google.com> wrote:

> Congratulations Pablo! Well deserved :D
>
> On Fri, May 17, 2019 at 3:14 PM Hannah Jiang 
> wrote:
>
>> Congratulations, Pablo, you deserve it!
>>
>> *From: *Mark Liu 
>> *Date: *Fri, May 17, 2019 at 2:45 PM
>> *To: * 
>>
>> Congratulations, Pablo!
>>>
>>> *From: *Alexey Romanenko 
>>> *Date: *Fri, May 17, 2019 at 2:12 AM
>>> *To: *dev
>>>
>>> Congratulations, Pablo!

 On 16 May 2019, at 20:38, Rui Wang  wrote:

 Congrats! Congrats! Congrats!

 -Rui

 On Thu, May 16, 2019 at 9:45 AM Udi Meiri  wrote:

> Congrats Pablo!
>
> On Thu, May 16, 2019 at 9:27 AM Thomas Weise  wrote:
>
>> Congratulations, Pablo!
>>
>>
>> On Thu, May 16, 2019 at 5:03 AM Katarzyna Kucharczyk <
>> ka.kucharc...@gmail.com> wrote:
>>
>>> Wow, great news!  Congratulations, Pablo!
>>>
>>> On Thu, May 16, 2019 at 1:28 PM Michał Walenia <
>>> michal.wale...@polidea.com> wrote:
>>>
 Congratulations, Pablo!

 On Thu, May 16, 2019 at 1:55 AM Rose Nguyen 
 wrote:

> Congrats, Pablo!!
>
> On Wed, May 15, 2019 at 4:43 PM Heejong Lee 
> wrote:
>
>> Congratulations!
>>
>> On Wed, May 15, 2019 at 12:24 PM Niklas Hansson <
>> niklas.sven.hans...@gmail.com> wrote:
>>
>>> Congratulations Pablo :)
>>>
>>> Den ons 15 maj 2019 kl 21:21 skrev Ruoyun Huang <
>>> ruo...@google.com>:
>>>
 Congratulations, Pablo!

 *From: *Charles Chen 
 *Date: *Wed, May 15, 2019 at 11:04 AM
 *To: *dev

 Congrats Pablo and thank you for your contributions!
>
> On Wed, May 15, 2019, 10:53 AM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> Congrats, Pablo!
>>
>> On Wed, May 15, 2019 at 10:41 AM Yifan Zou <
>> yifan...@google.com> wrote:
>>
>>> Congratulations, Pablo!
>>>
>>> *From: *Maximilian Michels 
>>> *Date: *Wed, May 15, 2019 at 2:06 AM
>>> *To: * 
>>>
>>> Congrats Pablo! Thank you for your help to grow the Beam
 community!

 On 15.05.19 10:33, Tim Robertson wrote:
 > Congratulations Pablo
 >
 > On Wed, May 15, 2019 at 10:22 AM Ismaël Mejía <
 ieme...@gmail.com
 > > wrote:
 >
 > Congrats Pablo, well deserved, nece to see your work
 recognized!
 >
 > On Wed, May 15, 2019 at 9:59 AM Pei HE <
 pei...@gmail.com
 > > wrote:
 >  >
 >  > Congrats, Pablo!
 >  >
 >  > On Tue, May 14, 2019 at 11:41 PM Tanay Tummalapalli
 >  > >>> ttanay.apa...@gmail.com>> wrote:
 >  > >
 >  > > Congratulations Pablo!
 >  > >
 >  > > On Wed, May 15, 2019, 12:08 Michael Luckey <
 adude3...@gmail.com
 > > wrote:
 >  > >>
 >  > >> Congrats, Pablo!
 >  > >>
 >  > >> On Wed, May 15, 2019 at 8:21 AM Connell
 O'Callaghan
 > mailto:conne...@google.com>>
 wrote:
 >  > >>>
 >  > >>> Awesome well done Pablo!!!
 >  > >>>
 >  > >>> Kenn thank you for sharing this great news
 with us!!!
 >  > >>>
 >  > >>> On Tue, May 14, 2019 at 11:01 PM Ahmet Altay
 > mailto:al...@google.com>> wrote:
 >  > 
 >  >  Congratulations!
 >  > 
 >  >  On Tue, May 14, 2019 at 9:11 PM Robert Burke
 > mailto:rob...@frantil.com>>
 wrote:
 >  > >
 >  > > Woohoo! Well deserved.
 >  > >
 >  > > On Tue, May 14, 2019, 8:34 PM Reuven Lax <
 re...@google.com
 > > wrote:
 >  > >>
 >  > >> Congratulations!
 >  > >>
 >  > >> From: Mikhail Gryzykhin <

Re: [ANNOUNCE] New committer announcement: Udi Meiri

2019-05-06 Thread Gleb Kanterov

Congratulations!

On Mon, May 6, 2019 at 2:34 PM Valentyn Tymofieiev 
wrote:

> Congrats, Udi!
>
> *From: *Thomas Weise 
> *Date: *Mon, May 6, 2019 at 7:50 AM
> *To: * 
>
> Congrats!
>>
>>
>> On Mon, May 6, 2019 at 2:25 AM Łukasz Gajowy  wrote:
>>
>>> Congrats! :)
>>>
>>> pon., 6 maj 2019 o 10:45 Alexey Romanenko 
>>> napisał(a):
>>>
 Congratulations, Udi! Thanks for your work!

 On 4 May 2019, at 01:24, Heejong Lee  wrote:

 Congratulations!

 On Fri, May 3, 2019 at 3:53 PM Reza Rokni  wrote:

> Congratulations !
>
> *From: *Reuven Lax 
> *Date: *Sat, 4 May 2019, 06:42
> *To: *dev
>
> Thank you!
>>
>> On Fri, May 3, 2019 at 3:15 PM Ankur Goenka 
>> wrote:
>>
>>> Congratulations Udi!
>>>
>>> On Fri, May 3, 2019 at 3:00 PM Connell O'Callaghan <
>>> conne...@google.com> wrote:
>>>
 Well done Udi!!! Congratulations and thank you for your
 contributions!!!

 Kenn thank you for sharing!!!

 On Fri, May 3, 2019 at 2:49 PM Yifan Zou 
 wrote:

> Thanks Udi and congratulations!
>
> On Fri, May 3, 2019 at 2:47 PM Robin Qiu 
> wrote:
>
>> Congratulations Udi!!!
>>
>> *From: *Ruoyun Huang 
>> *Date: *Fri, May 3, 2019 at 2:39 PM
>> *To: * 
>>
>> Congratulations Udi!
>>>
>>> On Fri, May 3, 2019 at 2:30 PM Ahmet Altay 
>>> wrote:
>>>
 Congratulations, Udi!

 *From: *Kyle Weaver 
 *Date: *Fri, May 3, 2019 at 2:11 PM
 *To: * 

 Congratulations Udi! I look forward to sending you all my
> reviews for
> the next month (just kidding :)
>
> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcwea...@google.com | +1650203
>
> On Fri, May 3, 2019 at 1:52 PM Charles Chen 
> wrote:
> >
> > Thank you Udi!
> >
> > On Fri, May 3, 2019, 1:51 PM Aizhamal Nurmamat kyzy <
> aizha...@google.com> wrote:
> >>
> >> Congratulations, Udi! Thank you for all your
> contributions!!!
> >>
> >> From: Pablo Estrada 
> >> Date: Fri, May 3, 2019 at 1:45 PM
> >> To: dev
> >>
> >>> Thanks Udi and congrats!
> >>>
> >>> On Fri, May 3, 2019 at 1:44 PM Kenneth Knowles <
> k...@apache.org> wrote:
> 
>  Hi all,
> 
>  Please join me and the rest of the Beam PMC in welcoming
> a new committer: Udi Meiri.
> 
>  Udi has been contributing to Beam since late 2017,
> starting with HDFS support in the Python SDK and continuing with 
> a ton of
> Python work. I also will highlight his work on community-building
> infrastructure, including documentation, experiments with ways to 
> find
> reviewers for pull requests, gradle build work, analyzing and 
> reducing
> build times.
> 
>  In consideration of Udi's contributions, the Beam PMC
> trusts Udi with the responsibilities of a Beam committer [1].
> 
>  Thank you, Udi, for your contributions.
> 
>  Kenn
> 
>  [1]
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>

>>>
>>> --
>>> 
>>> Ruoyun  Huang
>>>
>>>


-- 
Cheers,
Gleb

Re: Congrats to Beam's first 6 Google Open Source Peer Bonus recipients!

2019-05-02 Thread Gleb Kanterov

Congratulations! Well deserved!

On Thu, May 2, 2019 at 10:00 AM Ismaël Mejía  wrote:

> Congrats everyone !
>
> On Thu, May 2, 2019 at 9:14 AM Robert Bradshaw 
> wrote:
>
>> Congratulation, and thanks for all the great contributions each one of
>> you has made to Beam!
>>
>> On Thu, May 2, 2019 at 5:51 AM Ruoyun Huang  wrote:
>>
>>> Congratulations everyone!  Well deserved!
>>>
>>> On Wed, May 1, 2019 at 8:38 PM Kenneth Knowles  wrote:
>>>
 Congrats! All well deserved!

 Kenn

 On Wed, May 1, 2019 at 8:09 PM Reza Rokni  wrote:

> Congratulations!
>
> On Thu, 2 May 2019 at 10:53, Connell O'Callaghan 
> wrote:
>
>> Well done - congratulations to you all!!! Rose thank you for sharing
>> this news!!!
>>
>> On Wed, May 1, 2019 at 19:45 Rose Nguyen  wrote:
>>
>>> Matthias Baetens, Lukazs Gajowy, Suneel Marthi, Maximilian Michels,
>>> Alex Van Boxel, and Thomas Weise:
>>>
>>> Thank you for your exceptional contributions to Apache Beam. I'm
>>> looking forward to seeing this project grow and for more folks to
>>> contribute and be recognized! Everyone can read more about this award on
>>> the Google Open Source blog:
>>> https://opensource.googleblog.com/2019/04/google-open-source-peer-bonus-winners.html
>>>
>>> Cheers,
>>> --
>>> Rose Thị Nguyễn
>>>
>>
>
> --
>
> This email may be confidential and privileged. If you received this
> communication by mistake, please don't forward it to anyone else, please
> erase all copies and attachments, and please let me know that it has gone
> to the wrong person.
>
> The above terms reflect a potential business arrangement, are provided
> solely as a basis for further discussion, and are not intended to be and 
> do
> not constitute a legally binding obligation. No legally binding 
> obligations
> will be created, implied, or inferred until an agreement in final form is
> executed in writing by all parties involved.
>

>>>
>>> --
>>> 
>>> Ruoyun  Huang
>>>
>>>

-- 
Cheers,
Gleb

Re: [ANNOUNCE] New committer announcement: Boyuan Zhang

2019-04-16 Thread Gleb Kanterov

Congratulations!

On Sat, Apr 13, 2019 at 12:53 AM Thomas Weise  wrote:

> Congrats!
>
>
> On Thu, Apr 11, 2019 at 6:03 PM Reuven Lax  wrote:
>
>> Congratulations Boyuan!
>>
>> On Thu, Apr 11, 2019 at 4:53 PM Ankur Goenka  wrote:
>>
>>> Congrats Boyuan!
>>>
>>> On Thu, Apr 11, 2019 at 4:52 PM Mark Liu  wrote:
>>>
 Congrats Boyuan!

 On Thu, Apr 11, 2019 at 9:53 AM Alexey Romanenko <
 aromanenko@gmail.com> wrote:

> > since early 2018
> > 100+ pull requests
>
> Wow, this is impressive! Great job, congrats!
>
> > On 11 Apr 2019, at 15:08, Maximilian Michels  wrote:
> >
> > Great work! Congrats.
> >
> > On 11.04.19 13:41, Robert Bradshaw wrote:
> >> Congratulations!
> >> On Thu, Apr 11, 2019 at 12:29 PM Michael Luckey <
> adude3...@gmail.com> wrote:
> >>>
> >>> Congrats and welcome, Boyuan
> >>>
> >>> On Thu, Apr 11, 2019 at 12:27 PM Tim Robertson <
> timrobertson...@gmail.com> wrote:
> 
>  Many congratulations Boyuan!
> 
>  On Thu, Apr 11, 2019 at 10:50 AM Łukasz Gajowy <
> lgaj...@apache.org> wrote:
> >
> > Congrats Boyuan! :)
> >
> > śr., 10 kwi 2019 o 23:49 Chamikara Jayalath <
> chamik...@google.com> napisał(a):
> >>
> >> Congrats Boyuan!
> >>
> >> On Wed, Apr 10, 2019 at 11:14 AM Yifan Zou 
> wrote:
> >>>
> >>> Congratulations Boyuan!
> >>>
> >>> On Wed, Apr 10, 2019 at 10:49 AM Daniel Oliveira <
> danolive...@google.com> wrote:
> 
>  Congrats Boyuan!
> 
>  On Wed, Apr 10, 2019 at 10:20 AM Rui Wang 
> wrote:
> >
> > So well deserved!
> >
> > -Rui
> >
> > On Wed, Apr 10, 2019 at 10:12 AM Pablo Estrada <
> pabl...@google.com> wrote:
> >>
> >> Well deserved : ) congrats Boyuan!
> >>
> >> On Wed, Apr 10, 2019 at 10:08 AM Aizhamal Nurmamat kyzy <
> aizha...@google.com> wrote:
> >>>
> >>> Congratulations Boyuan!
> >>>
> >>> On Wed, Apr 10, 2019 at 9:52 AM Ruoyun Huang <
> ruo...@google.com> wrote:
> 
>  Thanks for your contributions and congratulations Boyuan!
> 
>  On Wed, Apr 10, 2019 at 9:00 AM Kenneth Knowles <
> k...@apache.org> wrote:
> >
> > Hi all,
> >
> > Please join me and the rest of the Beam PMC in welcoming
> a new committer: Boyuan Zhang.
> >
> > Boyuan has been contributing to Beam since early 2018.
> She has proposed 100+ pull requests across a wide range of topics: bug
> fixes, to integration tests, build improvements, metrics features, release
> automation. Two big picture things to highlight are building/releasing 
> Beam
> Python wheels and managing the donation of the Beam Dataflow Java Worker,
> including help with I.P. clearance.
> >
> > In consideration of Boyuan's contributions, the Beam PMC
> trusts Boyuan with the responsibilities of a Beam committer [1].
> >
> > Thank you, Boyuan, for your contributions.
> >
> > Kenn
> >
> > [1]
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
> 
> 
> 
>  --
>  
>  Ruoyun  Huang
> 
>
>

-- 
Cheers,
Gleb

Re: JDK11 support?

2019-04-10 Thread Gleb Kanterov

Is there a way to try JDK11 harness for Dataflow without building own
docker image?

On Wed, Apr 10, 2019 at 2:10 AM Yi Pan  wrote:

> Hi, Pablo,
>
> Thanks for the clarification. Does that mean that there needs to be a
> separate effort to ensure KafkaIO to be Java 8 source compat and Java 11
> runtime compat?
>
> -Yi
>
> On Tue, Apr 9, 2019 at 5:03 PM Pablo Estrada  wrote:
>
>> Our work (mostly done by Michal), consisted on testing that Beam
>> artifacts built with Java 8 can run in a Java 11 environment in the
>> DirectRunner and Dataflow. We have not planned other work in this area.
>>
>> On Tue, Apr 9, 2019 at 3:00 PM Ahmet Altay  wrote:
>>
>>> Most recently +Pablo Estrada  and +Michał Walenia
>>>  were working on Java 11 related JIRAs.
>>> They may have more context.
>>>
>>> On Tue, Apr 9, 2019 at 1:22 PM Maximilian Michels 
>>> wrote:
>>>
 Hi Yi,

 That's a great question. Beam is still on Java 8 at the moment. There
 is
 a JIRA issue for making Beam compatible with Java 11:
 https://issues.apache.org/jira/browse/BEAM-2530

 As you can read in the issue, the upcoming Beam 2.12.0 has experimental
 support for Java 11. That said, there is still code that needs a major
 overhaul, e.g. automatic staging of pipeline jars
 (https://issues.apache.org/jira/browse/BEAM-5495).

 I hope we can achieve proper Java 11 compatibility in the next
 releases.
 Perhaps somebody else can give further insight into how much work is
 left.

 Thanks,
 Max

 On 09.04.19 22:07, Yi Pan wrote:
 > Hi, everyone,
 >
 > I checked out the latest Beam code and seems that the document still
 > says "JDK8". Is there any plan to officially support JDK11?
 >
 > Thanks a lot!
 >
 > -Yi

>>>

-- 
Cheers,
Gleb

Re: [ANNOUNCE] New committer announcement: Mark Liu

2019-03-25 Thread Gleb Kanterov

Congratulations!

On Mon, Mar 25, 2019 at 10:23 AM Łukasz Gajowy  wrote:

> Congrats! :)
>
>
>
> pon., 25 mar 2019 o 08:11 Aizhamal Nurmamat kyzy 
> napisał(a):
>
>> Congratulations, Mark!
>>
>> On Sun, Mar 24, 2019 at 23:18 Pablo Estrada  wrote:
>>
>>> Yeaah  Mark! : ) Congrats : D
>>>
>>> On Sun, Mar 24, 2019 at 10:32 PM Yifan Zou  wrote:
>>>
 Congratulations Mark!

 On Sun, Mar 24, 2019 at 10:25 PM Connell O'Callaghan <
 conne...@google.com> wrote:

> Well done congratulations Mark!!!
>
> On Sun, Mar 24, 2019 at 10:17 PM Robert Burke 
> wrote:
>
>> Congratulations Mark! 
>>
>> On Sun, Mar 24, 2019, 10:08 PM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>>
>>> Congratulations, Mark!
>>>
>>> Thanks for your contributions, in particular for your efforts to
>>> parallelize test execution for Python SDK and increase the speed of 
>>> Python
>>> precommit checks.
>>>
>>> On Sun, Mar 24, 2019 at 9:40 PM Kenneth Knowles 
>>> wrote:
>>>
 Hi all,

 Please join me and the rest of the Beam PMC in welcoming a new
 committer: Mark Liu.

 Mark has been contributing to Beam since late 2016! He has proposed
 100+ pull requests. Mark was instrumental in expanding test and
 infrastructure coverage, especially for Python. In consideration
 of Mark's contributions, the Beam PMC trusts Mark with the 
 responsibilities
 of a Beam committer [1].

 Thank you, Mark, for your contributions.

 Kenn

 [1] https://beam.apache.org/contribute/become-a-committer/
 #an-apache-beam-committer

>>> --
>>
>> *Aizhamal Nurmamat kyzy*
>>
>> Open Source Program Manager
>>
>> 646-355-9740 Mobile
>>
>> 601 North 34th Street, Seattle, WA 98103
>>
>>
>>

-- 
Cheers,
Gleb

Always get to LGTM in Committer Guide

2019-03-12 Thread Gleb Kanterov

Before pressing merge button I was familiarizing myself with committer
guide [1]. It's saying:

> A committer (who is not the author of the code) should signal this either
by GitHub “approval” or by a comment such as “Looks good to me!” (LGTM).
Any committer can then merge the pull request. It is fine for a committer
to self-merge if another committer has reviewed the code and approved it,
just be sure to be explicit about whose job it is!

As I understand it, it's saying that the reviewer should be Beam Committer.
However, I remember from my personal experience and reading "An approach to
community building from Apache Beam" [2] that

> either the reviewer or the author be a committer

I'm wondering if we could rephrase our Commiter Guide a bit to make it
clear.

[1]: https://beam.apache.org/contribute/committer-guide/
[2]: https://blogs.apache.org/comdev/entry/an-approach-to-community-building

Re: [ANNOUNCE] New committer announcement: Raghu Angadi

2019-03-08 Thread Gleb Kanterov

Congratulations!

On Thu, Mar 7, 2019 at 11:52 PM Michael Luckey  wrote:

> Congrats Raghu!
>
> On Thu, Mar 7, 2019 at 8:06 PM Mark Liu  wrote:
>
>> Congrats!
>>
>> On Thu, Mar 7, 2019 at 10:45 AM Rui Wang  wrote:
>>
>>> Congrats Raghu!
>>>
>>>
>>> -Rui
>>>
>>> On Thu, Mar 7, 2019 at 10:22 AM Thomas Weise  wrote:
>>>
 Congrats!


 On Thu, Mar 7, 2019 at 10:11 AM Tim Robertson <
 timrobertson...@gmail.com> wrote:

> Congrats Raghu
>
> On Thu, Mar 7, 2019 at 7:09 PM Ahmet Altay  wrote:
>
>> Congratulations!
>>
>> On Thu, Mar 7, 2019 at 10:08 AM Ruoyun Huang 
>> wrote:
>>
>>> Thank you Raghu for your contribution!
>>>
>>>
>>>
>>> On Thu, Mar 7, 2019 at 9:58 AM Connell O'Callaghan <
>>> conne...@google.com> wrote:
>>>
 Congratulation Raghu!!! Thank you for sharing Kenn!!!

 On Thu, Mar 7, 2019 at 9:55 AM Ismaël Mejía 
 wrote:

> Congrats !
>
> Le jeu. 7 mars 2019 à 17:09, Aizhamal Nurmamat kyzy <
> aizha...@google.com> a écrit :
>
>> Congratulations, Raghu!!!
>> On Thu, Mar 7, 2019 at 08:07 Kenneth Knowles 
>> wrote:
>>
>>> Hi all,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a new
>>> committer: Raghu Angadi
>>>
>>> Raghu has been contributing to Beam since early 2016! He has
>>> continuously improved KafkaIO and supported on the user@ list
>>> but his community contributions are even more extensive, including 
>>> reviews,
>>> dev@ list discussions, improvements and ideas across SqsIO,
>>> FileIO, PubsubIO, and the Dataflow and Samza runners. In
>>> consideration of Raghu's contributions, the Beam PMC trusts Raghu 
>>> with the
>>> responsibilities of a Beam committer [1].
>>>
>>> Thank you, Raghu, for your contributions.
>>>
>>> Kenn
>>>
>>> [1] https://beam.apache.org/contribute/become-a-committer
>>> /#an-apache-beam-committer
>>>
>>
>>>
>>> --
>>> 
>>> Ruoyun  Huang
>>>
>>>

-- 
Cheers,
Gleb

Re: Merge of vendored Guava (Some PRs need a rebase)

2019-03-05 Thread Gleb Kanterov

Ismaël, I was looking into BEAM-5723, is it possible to relocate both guava
and Cassandra client instead of not relocating Guava in BEAM-6620?

On Tue, Mar 5, 2019 at 11:23 PM Gleb Kanterov  wrote:

> I agree with the points that Kenneth has raised, mainly:
>
> > In both of the above approaches, diamond dependency problems between IOs
> are quite possible.
>
> Option 1. With having more IO-s in Beam we would start hitting diamond
> dependency problem more often. Relocating dependencies will help, but for
> this, we should avoid exposing relocated classes to end-users of IO-s. I
> can't speak about everything, but in the case of BigtableIO, it only
> exposes proto-classes that aren't part of bigtable-client-core, and
> shouldn't be relocated.
>
> Option 2. Without relocation, every other IO can be potentially broken,
> and we can solve this problem on a case-by-case basis. In maven
> world situation becomes a little better with requireUpperBoundDeps [1] from
> maven-enforcer-plugin. I don't know if there is a similar solution for
> gradle.
>
> Option 3. There is a potential future solution for dependency conflicts
> between IO-s with Java 9 JPMS [2], however, it could take a while before we
> could use it due to compatibility issues.
>
> As a short term solution, option 2 seems the best, we could go through
> known conflicts and see if it's possible to resolve them, potentially
> looking into option 1 that would take much more time.
>
> [1]:
> https://maven.apache.org/enforcer/enforcer-rules/requireUpperBoundDeps.html
> [2]: https://en.wikipedia.org/wiki/Java_Platform_Module_System
>
>
> On Mon, Mar 4, 2019 at 4:45 PM Ismaël Mejía  wrote:
>
>> That looks interesting but I am not sure if I understand correctly,
>> isn't the problem that the system API (Bigtable, Cassandra, etc)
>> exposes guava related stuff? Or in other words, wouldn't the
>> transitivie version of guava leak anyway?
>> If it does not I am pretty interested on doing this to fix the
>> Cassandra IO from leaking too.
>> https://issues.apache.org/jira/browse/BEAM-5723
>>
>> On Thu, Feb 28, 2019 at 5:17 PM Kenneth Knowles  wrote:
>> >
>> > If someone is using BigTableIO with bigtable-client-core then having
>> BigTableIO and bigtable-client-core both depend on Guava 26.0 is fine,
>> right? Specifically, a user of BigTableIO after
>> https://github.com/apache/beam/pull/7957 will still have non-vendored
>> Guava on the classpath due to the transitive deps of bigtable-client-core.
>> >
>> > In any case it seems very wrong for the Beam root project to manage the
>> version of Guava in BigTableIO since the whole point is to be compatible
>> with bigtable-client-core. Would it work to delete our pinned Guava version
>> [1] and chase down all the places it breaks, moving Guava dependencies
>> local to places where an IO or extension must use it for interop? Then you
>> don't need adapters.
>> >
>> > In both of the above approaches, diamond dependency problems between
>> IOs are quite possible.
>> >
>> > I don't know if we can do better. For example, producing a
>> bigtable-client-core where we have relocated Guava internally and using
>> that could really be an interop nightmare as things that look like the same
>> type would not be. Less likely to be broken would be bigtable-client-core
>> entirely relocated and vendored, but generally IO connectors exchange
>> objects with users and the users would have to use the relocated versions,
>> so that's gross.
>> >
>> > Kenn
>> >
>> > [1]
>> https://github.com/apache/beam/blob/master/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L353
>> >
>> >
>> > On Thu, Feb 28, 2019 at 2:29 AM Gleb Kanterov  wrote:
>> >>
>> >> For the past week, two independent people have asked me if I can help
>> with guava NoSuchMethodError in BigtableIO. It turns out we still have a
>> potential problem with dependencies that don't vendor guava, in this case,
>> it was bigtable-client-core that depends on guava-26.0. However, the root
>> cause of the classpath problem was in the usage of a deprecated method from
>> non-vendored guava in BigtableServiceClientImpl in the code path where we
>> integrate with bigtable client.
>> >>
>> >> I created apache/beam#7957 [1] to fix that. There few other IO-s where
>> we use non-vendored guava that we can fix using adapters.
>> >>
>> >> And there is an unknown number of conflicts between guava versions in
>> our dependencies that don't vendor it, that as I

Re: Merge of vendored Guava (Some PRs need a rebase)

2019-03-05 Thread Gleb Kanterov

I agree with the points that Kenneth has raised, mainly:

> In both of the above approaches, diamond dependency problems between IOs
are quite possible.

Option 1. With having more IO-s in Beam we would start hitting diamond
dependency problem more often. Relocating dependencies will help, but for
this, we should avoid exposing relocated classes to end-users of IO-s. I
can't speak about everything, but in the case of BigtableIO, it only
exposes proto-classes that aren't part of bigtable-client-core, and
shouldn't be relocated.

Option 2. Without relocation, every other IO can be potentially broken, and
we can solve this problem on a case-by-case basis. In maven world situation
becomes a little better with requireUpperBoundDeps [1] from
maven-enforcer-plugin. I don't know if there is a similar solution for
gradle.

Option 3. There is a potential future solution for dependency conflicts
between IO-s with Java 9 JPMS [2], however, it could take a while before we
could use it due to compatibility issues.

As a short term solution, option 2 seems the best, we could go through
known conflicts and see if it's possible to resolve them, potentially
looking into option 1 that would take much more time.

[1]:
https://maven.apache.org/enforcer/enforcer-rules/requireUpperBoundDeps.html
[2]: https://en.wikipedia.org/wiki/Java_Platform_Module_System


On Mon, Mar 4, 2019 at 4:45 PM Ismaël Mejía  wrote:

> That looks interesting but I am not sure if I understand correctly,
> isn't the problem that the system API (Bigtable, Cassandra, etc)
> exposes guava related stuff? Or in other words, wouldn't the
> transitivie version of guava leak anyway?
> If it does not I am pretty interested on doing this to fix the
> Cassandra IO from leaking too.
> https://issues.apache.org/jira/browse/BEAM-5723
>
> On Thu, Feb 28, 2019 at 5:17 PM Kenneth Knowles  wrote:
> >
> > If someone is using BigTableIO with bigtable-client-core then having
> BigTableIO and bigtable-client-core both depend on Guava 26.0 is fine,
> right? Specifically, a user of BigTableIO after
> https://github.com/apache/beam/pull/7957 will still have non-vendored
> Guava on the classpath due to the transitive deps of bigtable-client-core.
> >
> > In any case it seems very wrong for the Beam root project to manage the
> version of Guava in BigTableIO since the whole point is to be compatible
> with bigtable-client-core. Would it work to delete our pinned Guava version
> [1] and chase down all the places it breaks, moving Guava dependencies
> local to places where an IO or extension must use it for interop? Then you
> don't need adapters.
> >
> > In both of the above approaches, diamond dependency problems between IOs
> are quite possible.
> >
> > I don't know if we can do better. For example, producing a
> bigtable-client-core where we have relocated Guava internally and using
> that could really be an interop nightmare as things that look like the same
> type would not be. Less likely to be broken would be bigtable-client-core
> entirely relocated and vendored, but generally IO connectors exchange
> objects with users and the users would have to use the relocated versions,
> so that's gross.
> >
> > Kenn
> >
> > [1]
> https://github.com/apache/beam/blob/master/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L353
> >
> >
> > On Thu, Feb 28, 2019 at 2:29 AM Gleb Kanterov  wrote:
> >>
> >> For the past week, two independent people have asked me if I can help
> with guava NoSuchMethodError in BigtableIO. It turns out we still have a
> potential problem with dependencies that don't vendor guava, in this case,
> it was bigtable-client-core that depends on guava-26.0. However, the root
> cause of the classpath problem was in the usage of a deprecated method from
> non-vendored guava in BigtableServiceClientImpl in the code path where we
> integrate with bigtable client.
> >>
> >> I created apache/beam#7957 [1] to fix that. There few other IO-s where
> we use non-vendored guava that we can fix using adapters.
> >>
> >> And there is an unknown number of conflicts between guava versions in
> our dependencies that don't vendor it, that as I understand it, could be
> fixed by relocating them, in a similar way we do for Calcite [2].
> >>
> >> [1]: https://github.com/apache/beam/pull/7957
> >> [2]:
> https://github.com/apache/beam/blob/61de62ecbe8658de866280a8976030a0cb877041/sdks/java/extensions/sql/build.gradle#L30-L39
> >>
> >> Gleb
> >>
> >> On Sun, Jan 20, 2019 at 11:43 AM Gleb Kanterov 
> wrote:
> >>>
> >>> I didn't look deep into it, but it seems we can put
> .idea/codeInsightSettings.xml into our reposito

Re: Merge of vendored Guava (Some PRs need a rebase)

2019-02-28 Thread Gleb Kanterov

For the past week, two independent people have asked me if I can help with
guava NoSuchMethodError in BigtableIO. It turns out we still have a
potential problem with dependencies that don't vendor guava, in this case,
it was bigtable-client-core that depends on guava-26.0. However, the root
cause of the classpath problem was in the usage of a deprecated method from
non-vendored guava in BigtableServiceClientImpl in the code path where we
integrate with bigtable client.

I created apache/beam#7957 <https://github.com/apache/beam/pull/7957> [1]
to fix that. There few other IO-s where we use non-vendored guava that we
can fix using adapters.

And there is an unknown number of conflicts between guava versions in our
dependencies that don't vendor it, that as I understand it, could be fixed
by relocating them, in a similar way we do for Calcite [2].

[1]: https://github.com/apache/beam/pull/7957
[2]:
https://github.com/apache/beam/blob/61de62ecbe8658de866280a8976030a0cb877041/sdks/java/extensions/sql/build.gradle#L30-L39

Gleb

On Sun, Jan 20, 2019 at 11:43 AM Gleb Kanterov  wrote:

> I didn't look deep into it, but it seems we can put
> .idea/codeInsightSettings.xml into our repository where we blacklist
> packages from auto-import. See an example in
> JetBrains/kotlin/.idea/codeInsightSettings.xml
> <https://github.com/JetBrains/kotlin/blob/master/.idea/codeInsightSettings.xml>
> .
>
> On Sat, Jan 19, 2019 at 8:03 PM Reuven Lax  wrote:
>
>> Bad IDEs automatically generate the wrong import. I think we need to
>> automatically prevent this, otherwise the bad imports will inevitably slip
>> back in.
>>
>> Reuven
>>
>> On Tue, Jan 15, 2019 at 2:54 AM Łukasz Gajowy 
>> wrote:
>>
>>> Great news. Thanks all for this work!
>>>
>>> +1 to enforcing this on dependency level as Kenn suggested.
>>>
>>> Łukasz
>>>
>>> wt., 15 sty 2019 o 01:18 Kenneth Knowles  napisał(a):
>>>
>>>> We can enforce at the dependency level, since it is a compile error. I
>>>> think some IDEs and build tools may allow the compile-time classpath to get
>>>> polluted by transitive runtime deps, so protecting against bad imports is
>>>> also a good idea.
>>>>
>>>> Kenn
>>>>
>>>> On Mon, Jan 14, 2019 at 8:42 AM Ismaël Mejía  wrote:
>>>>
>>>>> Not yet, we need to add that too, there are still some tasks to be
>>>>> done like improve the contribution guide with this info, and document
>>>>> how to  generate a src build artifact locally since I doubt we can
>>>>> publish that into Apache for copyright reasons.
>>>>> I will message in the future for awareness for awareness when most of
>>>>> the pending tasks are finished.
>>>>>
>>>>>
>>>>> On Mon, Jan 14, 2019 at 3:51 PM Maximilian Michels 
>>>>> wrote:
>>>>> >
>>>>> > Thanks for the heads up, Ismaël! Great to see the vendored Guava
>>>>> version is used
>>>>> > everywhere now.
>>>>> >
>>>>> > Do we already have a Checkstyle rule that prevents people from using
>>>>> the
>>>>> > unvendored Guava? If not, such a rule could be useful because it's
>>>>> almost
>>>>> > inevitable that the unvedored Guava will slip back in.
>>>>> >
>>>>> > Cheers,
>>>>> > Max
>>>>> >
>>>>> > On 14.01.19 05:55, Ismaël Mejía wrote:
>>>>> > > We merged today the PR [1] that changes most of the code to use our
>>>>> > > new guava vendored dependency. In practice it means that most of
>>>>> the
>>>>> > > imports of the classes were changed from `com.google.common.` to
>>>>> > > `org.apache.beam.vendor.guava.v20_0.com.google.common.`
>>>>> > >
>>>>> > > This is a great improvement to fix a long existing problem of guava
>>>>> > > leaking through some Beam modules. This also reduces the size of
>>>>> most
>>>>> > > jars in the project because they don't need to relocate and include
>>>>> > > guava anymore, they just use the vendored dependency.
>>>>> > >
>>>>> > > Kudos to Kenn Knowles, Lukasz Cwik, Scott Wegner and the others
>>>>> that
>>>>> > > worked (are working) to make this possible.
>>>>> > >
>>>>> > > Sadly as a side effect of the merge of this PR multiple PRs were
>>>>> > > broken so please review if yours was and do a rebase and fix the
>>>>> > > imports to use the new vendored dependency. Sorry for the
>>>>> > > inconvenience. From now one all uses of guava should use the
>>>>> vendored
>>>>> > > version. Expect some updates in the docs.
>>>>> > >
>>>>> > > [1]  https://github.com/apache/beam/pull/6809
>>>>> > >
>>>>>
>>>>
>
> --
> Cheers,
> Gleb
>


-- 
Cheers,
Gleb

Re: [ANNOUNCE] New committer announcement: Michael Luckey

2019-02-27 Thread Gleb Kanterov

Congratulations and welcome!

On Wed, Feb 27, 2019 at 8:57 PM Connell O'Callaghan 
wrote:

> Excellent thank you for sharing Kenn!!!
>
> Michael congratulations for this recognition of your contributions to
> advancing BEAM
>
> On Wed, Feb 27, 2019 at 11:52 AM Kenneth Knowles  wrote:
>
>> Hi all,
>>
>> Please join me and the rest of the Beam PMC in welcoming a new committer:
>> Michael Luckey
>>
>> Michael has been contributing to Beam since early 2017. He has fixed many
>> build and developer environment issues, noted and root-caused breakages on
>> master, generously reviewed many others' changes to the build. In
>> consideration of Michael's contributions, the Beam PMC trusts Michael with
>> the responsibilities of a Beam committer [1].
>>
>> Thank you, Michael, for your contributions.
>>
>> Kenn
>>
>> [1] https://beam.apache.org/contribute/become-a-committer/#an-apache-beam
>> -committer
>>
>

-- 
Cheers,
Gleb

Re: BEAM-6639. ClickHouseIOTest flakey failure failing in precomiits

2019-02-09 Thread Gleb Kanterov

I'm looking into it, it seems that previous mitigation didn't help. I added
extra logging and going to try to reproduce flakey failure again. Sorry for
the inconvenience, I've never experienced such problems with testcontainers
before.

On Sat, Feb 9, 2019 at 12:36 AM Alex Amato  wrote:

> https://issues.apache.org/jira/browse/BEAM-6639
>
> Noticed this failure in precommits,
>
>
> https://builds.apache.org/job/beam_PreCommit_Java_Commit/4166/testReport/junit/org.apache.beam.sdk.io.clickhouse/ClickHouseIOTest/classMethod/
>
> Any ideas what's going on with this?
>
>
>

-- 
Cheers,
Gleb

Re: ContainerLaunchException in precommit [BEAM-6497]

2019-02-05 Thread Gleb Kanterov

It seems ClickHouse tests aren't flaky anymore, please reopen JIRA issue if
you find it flaky again.

On Thu, Jan 31, 2019 at 4:08 PM Gleb Kanterov  wrote:

> There are two tests using testcontainers. I've noticed that in one of the
> failed builds
> <https://builds.apache.org/job/beam_PreCommit_Java_Commit/3869/> [1] only
> one of them has failed to pull docker image. I suspect that adding retries
> to container pull step can mitigate this issue. I've submitted a pull
> request apache/beam#7689 <https://github.com/apache/beam/pull/7689> [2].
>
> [1] https://builds.apache.org/job/beam_PreCommit_Java_Commit/3869/
> [2] https://github.com/apache/beam/pull/7689
>
> On Wed, Jan 30, 2019 at 12:15 AM Kenneth Knowles  wrote:
>
>> I retract my statement. I failed at web browsing.
>>
>> On Tue, Jan 29, 2019 at 3:14 PM Kenneth Knowles  wrote:
>>
>>> Version 18.10.3 no longer appears on the linked page.
>>>
>>> On Tue, Jan 29, 2019 at 3:08 PM David Rieber  wrote:
>>>
>>>> I am consistently hitting that error on this PR:
>>>> https://github.com/apache/beam/pull/7631
>>>>
>>>>
>>>> On Thu, Jan 24, 2019 at 9:14 AM Alex Amato  wrote:
>>>>
>>>>> I have just seen it randomly occur on presubmits. So I don't have a
>>>>> reliable repro, unfortunately.
>>>>> It may be a specific environmental issue to the beam1 machine the
>>>>> tests ran on?
>>>>> https://builds.apache.org/job/beam_PreCommit_Java_Commit/3722/
>>>>>
>>>>>
>>>>> On Thu, Jan 24, 2019 at 8:16 AM Gleb Kanterov 
>>>>> wrote:
>>>>>
>>>>>> I'm wondering if anybody can reproduce this issue. The build has
>>>>>> failed once because testcontainers didn't pull docker image. If we use
>>>>>> caching proxy for docker, it could be a reason for that. I didn't find 
>>>>>> any
>>>>>> similar known issue in testcontainers fixed recently, but just in case, I
>>>>>> bumped testcontainers to use never docker-java.
>>>>>>
>>>>>> https://github.com/apache/beam/pull/7610
>>>>>>
>>>>>> On Thu, Jan 24, 2019 at 12:27 AM Alex Amato 
>>>>>> wrote:
>>>>>>
>>>>>>> Thank you Gleb, appreciate it.
>>>>>>>
>>>>>>> On Wed, Jan 23, 2019 at 2:40 PM Gleb Kanterov 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I'm looking into it. This image exists in docker hub [1], but for
>>>>>>>> some reason, it wasn't picked up.
>>>>>>>>
>>>>>>>> [1] https://hub.docker.com/r/yandex/clickhouse-server/tags
>>>>>>>>
>>>>>>>> On Wed, Jan 23, 2019 at 10:01 PM Alex Amato 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>1.
>>>>>>>>>   See: BEAM-6497
>>>>>>>>>   <https://jira.apache.org/jira/browse/BEAM-6497>
>>>>>>>>>   1. This is also causing issues blocking precommits.
>>>>>>>>>   2.
>>>>>>>>>   Seems to be caused by this failure to locate the image. Are
>>>>>>>>>  these stored somewhere or built by the build process? Any 
>>>>>>>>> idea why these
>>>>>>>>>  are failing?
>>>>>>>>>
>>>>>>>>>  Caused by: 
>>>>>>>>> com.github.dockerjava.api.exception.NotFoundException: {"message":"No 
>>>>>>>>> such image: yandex/clickhouse-server:18.10.3"}
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Cheers,
>>>>>>>> Gleb
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Cheers,
>>>>>> Gleb
>>>>>>
>>>>>
>
> --
> Cheers,
> Gleb
>


-- 
Cheers,
Gleb

Re: ContainerLaunchException in precommit [BEAM-6497]

2019-01-31 Thread Gleb Kanterov

There are two tests using testcontainers. I've noticed that in one of the
failed builds
<https://builds.apache.org/job/beam_PreCommit_Java_Commit/3869/> [1] only
one of them has failed to pull docker image. I suspect that adding retries
to container pull step can mitigate this issue. I've submitted a pull
request apache/beam#7689 <https://github.com/apache/beam/pull/7689> [2].

[1] https://builds.apache.org/job/beam_PreCommit_Java_Commit/3869/
[2] https://github.com/apache/beam/pull/7689

On Wed, Jan 30, 2019 at 12:15 AM Kenneth Knowles  wrote:

> I retract my statement. I failed at web browsing.
>
> On Tue, Jan 29, 2019 at 3:14 PM Kenneth Knowles  wrote:
>
>> Version 18.10.3 no longer appears on the linked page.
>>
>> On Tue, Jan 29, 2019 at 3:08 PM David Rieber  wrote:
>>
>>> I am consistently hitting that error on this PR:
>>> https://github.com/apache/beam/pull/7631
>>>
>>>
>>> On Thu, Jan 24, 2019 at 9:14 AM Alex Amato  wrote:
>>>
>>>> I have just seen it randomly occur on presubmits. So I don't have a
>>>> reliable repro, unfortunately.
>>>> It may be a specific environmental issue to the beam1 machine the tests
>>>> ran on?
>>>> https://builds.apache.org/job/beam_PreCommit_Java_Commit/3722/
>>>>
>>>>
>>>> On Thu, Jan 24, 2019 at 8:16 AM Gleb Kanterov  wrote:
>>>>
>>>>> I'm wondering if anybody can reproduce this issue. The build has
>>>>> failed once because testcontainers didn't pull docker image. If we use
>>>>> caching proxy for docker, it could be a reason for that. I didn't find any
>>>>> similar known issue in testcontainers fixed recently, but just in case, I
>>>>> bumped testcontainers to use never docker-java.
>>>>>
>>>>> https://github.com/apache/beam/pull/7610
>>>>>
>>>>> On Thu, Jan 24, 2019 at 12:27 AM Alex Amato 
>>>>> wrote:
>>>>>
>>>>>> Thank you Gleb, appreciate it.
>>>>>>
>>>>>> On Wed, Jan 23, 2019 at 2:40 PM Gleb Kanterov 
>>>>>> wrote:
>>>>>>
>>>>>>> I'm looking into it. This image exists in docker hub [1], but for
>>>>>>> some reason, it wasn't picked up.
>>>>>>>
>>>>>>> [1] https://hub.docker.com/r/yandex/clickhouse-server/tags
>>>>>>>
>>>>>>> On Wed, Jan 23, 2019 at 10:01 PM Alex Amato 
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>1.
>>>>>>>>   See: BEAM-6497
>>>>>>>>   <https://jira.apache.org/jira/browse/BEAM-6497>
>>>>>>>>   1. This is also causing issues blocking precommits.
>>>>>>>>   2.
>>>>>>>>   Seems to be caused by this failure to locate the image. Are
>>>>>>>>  these stored somewhere or built by the build process? Any 
>>>>>>>> idea why these
>>>>>>>>  are failing?
>>>>>>>>
>>>>>>>>  Caused by: 
>>>>>>>> com.github.dockerjava.api.exception.NotFoundException: {"message":"No 
>>>>>>>> such image: yandex/clickhouse-server:18.10.3"}
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Cheers,
>>>>>>> Gleb
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Cheers,
>>>>> Gleb
>>>>>
>>>>

-- 
Cheers,
Gleb

Re: Another new contributor!

2019-01-31 Thread Gleb Kanterov

Welcome! Would be interesting to hear your thoughts on Arrow, Arrow Flight,
and Beam Portability relation, this topic was recently discussed in dev@.

On Thu, Jan 31, 2019 at 2:00 PM Ismaël Mejía  wrote:

> Welcome Brian!
> Great to have someone with Apache experience already and also with
> Arrow knowledge.
>
> On Thu, Jan 31, 2019 at 1:32 PM Maximilian Michels  wrote:
> >
> > Welcome! Arrow and Beam together would open lots of possibilities.
> Portability
> > documentation improvements would be much appreciated :)
> >
> > On 31.01.19 11:25, Łukasz Gajowy wrote:
> > > Welcome!
> > >
> > > czw., 31 sty 2019 o 02:40 Kenneth Knowles  > > > napisał(a):
> > >
> > > Welcome!
> > >
> > > On Wed, Jan 30, 2019, 17:30 Connell O'Callaghan <
> conne...@google.com
> > >  wrote:
> > >
> > > Welcome on board Brian!
> > >
> > > On Wed, Jan 30, 2019 at 5:29 PM Ahmet Altay  > > > wrote:
> > >
> > > Welcome Brian!
> > >
> > > On Wed, Jan 30, 2019 at 5:26 PM Brian Hulette <
> bhule...@google.com
> > > > wrote:
> > >
> > > Hi everyone,
> > > I'm Brian Hulette, I just switched roles at Google and
> I'll be
> > > contributing to Beam Portability as part of my new
> position. For
> > > now I'm just going through documentation and getting
> familiar
> > > with Beam from the user perspective, so if anything
> I'll just be
> > > suggesting minor edits to documentation, but I hope to
> be
> > > putting up PRs soon enough.
> > >
> > > I am also an Apache committer (bhulette is my ASF id
> and Jira
> > > username). I worked on the Arrow project's Javascript
> > > implementation in a previous job, and I'm really
> excited to look
> > > for ways to use Arrow and Beam together once I've
> ramped up.
> > >
> > > Brian
> > >
>


-- 
Cheers,
Gleb

Re: Findbugs -> Spotbugs ?

2019-01-31 Thread Gleb Kanterov

Agree, spotbugs brings static checks that aren't covered in error-prone,
it's a good addition. There are few conflicts between error-prone and
spotbugs, for instance, the approach to enum switch exhaustiveness, but it
can be configured.

On Thu, Jan 31, 2019 at 10:53 AM Ismaël Mejía  wrote:

> Not a blocker but there is not a spotbugs plugin for IntelliJ.
>
> On Thu, Jan 31, 2019 at 10:45 AM Ismaël Mejía  wrote:
> >
> > YES PLEASE let's move to spotbugs !
> > Findbugs has not had a new release in ages, and does not support Java
> > 11 either, so this will address another possible issue.
> >
> > On Thu, Jan 31, 2019 at 8:28 AM Kenneth Knowles  wrote:
> > >
> > > Over the last few hours I activated findbugs on the Dataflow Java
> worker and fixed or suppressed the errors. They started around 60 but
> fixing some uncovered others, etc. You can see the result at
> https://github.com/apache/beam/pull/7684.
> > >
> > > It has convinced me that findbugs still adds value, beyond errorprone
> and nullaway/checker/infer. Quite a few of the issues were not nullability
> related, though nullability remains the most obvious low-hanging fruit
> where a different tool would do even better than findbugs. I have not yet
> enable "non null by default" which exposes 100+ new bugs in the worker, at
> minimum.
> > >
> > > Are there known blockers for upgrading to spotbugs so we are depending
> on an active project?
> > >
> > > Kenn
>


-- 
Cheers,
Gleb

Re: [ANNOUNCE] New PMC member: Etienne Chauchot

2019-01-28 Thread Gleb Kanterov

Congratulations Etienne!


On Mon, Jan 28, 2019 at 11:36 AM Maximilian Michels  wrote:

> Congrats Etienne! It's been great to work with you.
>
> On 26.01.19 07:16, Ismaël Mejía wrote:
> > Congratulations Etienne!
> >
> > Le sam. 26 janv. 2019 à 06:42, Reuven Lax  > > a écrit :
> >
> > Welcome!
> >
> > On Fri, Jan 25, 2019 at 9:30 PM Pablo Estrada  > > wrote:
> >
> > Congrats Etienne :)
> >
> > On Fri, Jan 25, 2019, 9:24 PM Trần Thành Đạt <
> dattran.v...@gmail.com
> >  wrote:
> >
> > Congratulations Etienne!
> >
> > On Sat, Jan 26, 2019 at 12:08 PM Thomas Weise <
> t...@apache.org
> > > wrote:
> >
> > Congrats, félicitations!
> >
> >
> > On Fri, Jan 25, 2019 at 3:06 PM Scott Wegner <
> sc...@apache.org
> > > wrote:
> >
> > Congrats Etienne!
> >
> > On Fri, Jan 25, 2019 at 2:34 PM Tim
> >  > > wrote:
> >
> > Congratulations Etienne!
> >
> > Tim
> >
> >  > On 25 Jan 2019, at 23:00, Kenneth Knowles
> > mailto:k...@apache.org>>
> wrote:
> >  >
> >  > Hi all,
> >  >
> >  > Please join me and the rest of the Beam PMC in
> > welcoming Etienne Chauchot to join the PMC.
> >  >
> >  > Etienne introduced himself to dev@ in
> September of
> > 2017 and over the years has contributed to Beam
> in many
> > ways - connectors, performance, design
> discussion,
> > talks, code reviews, and I'm sure I cannot list
> them
> > all. He already has a major impact on the
> direction of Beam.
> >  >
> >  > Thanks for being a part of Beam, Etienne!
> >  >
> >  > Kenn
> >
> >
> >
> > --
> >
> >
> >
> >
> > Got feedback? tinyurl.com/swegner-feedback
> > 
> >
>


-- 
Cheers,
Gleb

Re: [ANNOUNCE] New committer announcement: Gleb Kanterov

2019-01-28 Thread Gleb Kanterov

Thanks to everyone for a warm welcome! Happy to be part of the community.

On Mon, Jan 28, 2019 at 4:43 AM Thomas Weise  wrote:

> Congrats!
>
> On Sun, Jan 27, 2019 at 7:19 PM Reza Ardeshir Rokni 
> wrote:
>
>> Congratulations!
>>
>> On Sat, 26 Jan 2019 at 13:50, Pablo Estrada  wrote:
>>
>>> Congrats:)
>>>
>>> On Fri, Jan 25, 2019, 9:24 PM Trần Thành Đạt >> wrote:
>>>
>>>> Congratulations Gleb!
>>>>
>>>> On Sat, Jan 26, 2019 at 4:19 AM David Morávek 
>>>> wrote:
>>>>
>>>>> Congratulations!
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On 25 Jan 2019, at 20:41, Kai Jiang  wrote:
>>>>>
>>>>> Congratulations!
>>>>>
>>>>> On Fri, Jan 25, 2019 at 10:01 AM Rui Wang  wrote:
>>>>>
>>>>>> Congratulations!
>>>>>>
>>>>>> -Rui
>>>>>>
>>>>>> On Fri, Jan 25, 2019 at 9:58 AM Ruoyun Huang 
>>>>>> wrote:
>>>>>>
>>>>>>> Congratulations Gleb!
>>>>>>>
>>>>>>> On Fri, Jan 25, 2019 at 9:18 AM Scott Wegner 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Congrats, and welcome Gleb!
>>>>>>>>
>>>>>>>> On Fri, Jan 25, 2019 at 9:15 AM Suneel Marthi 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Congratulations
>>>>>>>>>
>>>>>>>>> On Fri, Jan 25, 2019 at 12:04 PM Anton Kedin 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Congrats!
>>>>>>>>>>
>>>>>>>>>> On Fri, Jan 25, 2019 at 8:54 AM Ismaël Mejía 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Well deserved, congratulations Gleb!
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jan 25, 2019 at 10:47 AM Etienne Chauchot <
>>>>>>>>>>> echauc...@apache.org> wrote:
>>>>>>>>>>> >
>>>>>>>>>>> > Congrats Gleb and welcome onboard !
>>>>>>>>>>> >
>>>>>>>>>>> > Etienne
>>>>>>>>>>> >
>>>>>>>>>>> > Le vendredi 25 janvier 2019 à 10:39 +0100, Alexey Romanenko a
>>>>>>>>>>> écrit :
>>>>>>>>>>> >
>>>>>>>>>>> > Congrats to Gleb and welcome on board!
>>>>>>>>>>> >
>>>>>>>>>>> > On 25 Jan 2019, at 09:22, Tim Robertson <
>>>>>>>>>>> timrobertson...@gmail.com> wrote:
>>>>>>>>>>> >
>>>>>>>>>>> > Welcome Gleb and congratulations!
>>>>>>>>>>> >
>>>>>>>>>>> > On Fri, Jan 25, 2019 at 8:06 AM Kenneth Knowles <
>>>>>>>>>>> k...@google.com> wrote:
>>>>>>>>>>> >
>>>>>>>>>>> > Hi all,
>>>>>>>>>>> >
>>>>>>>>>>> > Please join me and the rest of the Beam PMC in welcoming a new
>>>>>>>>>>> committer: Gleb Kanterov
>>>>>>>>>>> >
>>>>>>>>>>> > Gleb started contributing to Beam and quickly dove deep, doing
>>>>>>>>>>> some sensitive fixes to schemas, also general build issues, Beam 
>>>>>>>>>>> SQL, Avro,
>>>>>>>>>>> and more. In consideration of Gleb's technical and community 
>>>>>>>>>>> contributions,
>>>>>>>>>>> the Beam PMC trusts Gleb with the responsibilities of a Beam 
>>>>>>>>>>> committer [1].
>>>>>>>>>>> >
>>>>>>>>>>> > Thank you, Gleb, for your contributions.
>>>>>>>>>>> >
>>>>>>>>>>> > Kenn
>>>>>>>>>>> >
>>>>>>>>>>> > [1]
>>>>>>>>>>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Got feedback? tinyurl.com/swegner-feedback
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> 
>>>>>>> Ruoyun  Huang
>>>>>>>
>>>>>>>

-- 
Cheers,
Gleb

Re: ContainerLaunchException in precommit [BEAM-6497]

2019-01-24 Thread Gleb Kanterov

I'm wondering if anybody can reproduce this issue. The build has failed
once because testcontainers didn't pull docker image. If we use caching
proxy for docker, it could be a reason for that. I didn't find any similar
known issue in testcontainers fixed recently, but just in case, I bumped
testcontainers to use never docker-java.

https://github.com/apache/beam/pull/7610

On Thu, Jan 24, 2019 at 12:27 AM Alex Amato  wrote:

> Thank you Gleb, appreciate it.
>
> On Wed, Jan 23, 2019 at 2:40 PM Gleb Kanterov  wrote:
>
>> I'm looking into it. This image exists in docker hub [1], but for some
>> reason, it wasn't picked up.
>>
>> [1] https://hub.docker.com/r/yandex/clickhouse-server/tags
>>
>> On Wed, Jan 23, 2019 at 10:01 PM Alex Amato  wrote:
>>
>>>
>>>1.
>>>   See: BEAM-6497 <https://jira.apache.org/jira/browse/BEAM-6497>
>>>   1. This is also causing issues blocking precommits.
>>>   2.
>>>   Seems to be caused by this failure to locate the image. Are these
>>>  stored somewhere or built by the build process? Any idea why these 
>>> are
>>>  failing?
>>>
>>>  Caused by: com.github.dockerjava.api.exception.NotFoundException: 
>>> {"message":"No such image: yandex/clickhouse-server:18.10.3"}
>>>
>>>
>>>
>>>
>>
>> --
>> Cheers,
>> Gleb
>>
>

-- 
Cheers,
Gleb

Re: ContainerLaunchException in precommit [BEAM-6497]

2019-01-23 Thread Gleb Kanterov

I'm looking into it. This image exists in docker hub [1], but for some
reason, it wasn't picked up.

[1] https://hub.docker.com/r/yandex/clickhouse-server/tags

On Wed, Jan 23, 2019 at 10:01 PM Alex Amato  wrote:

>
>1.
>   See: BEAM-6497 
>   1. This is also causing issues blocking precommits.
>   2.
>   Seems to be caused by this failure to locate the image. Are these
>  stored somewhere or built by the build process? Any idea why these 
> are
>  failing?
>
>  Caused by: com.github.dockerjava.api.exception.NotFoundException: 
> {"message":"No such image: yandex/clickhouse-server:18.10.3"}
>
>
>
>

-- 
Cheers,
Gleb

Re: Merge of vendored Guava (Some PRs need a rebase)

2019-01-20 Thread Gleb Kanterov

I didn't look deep into it, but it seems we can put
.idea/codeInsightSettings.xml into our repository where we blacklist
packages from auto-import. See an example in
JetBrains/kotlin/.idea/codeInsightSettings.xml

.

On Sat, Jan 19, 2019 at 8:03 PM Reuven Lax  wrote:

> Bad IDEs automatically generate the wrong import. I think we need to
> automatically prevent this, otherwise the bad imports will inevitably slip
> back in.
>
> Reuven
>
> On Tue, Jan 15, 2019 at 2:54 AM Łukasz Gajowy 
> wrote:
>
>> Great news. Thanks all for this work!
>>
>> +1 to enforcing this on dependency level as Kenn suggested.
>>
>> Łukasz
>>
>> wt., 15 sty 2019 o 01:18 Kenneth Knowles  napisał(a):
>>
>>> We can enforce at the dependency level, since it is a compile error. I
>>> think some IDEs and build tools may allow the compile-time classpath to get
>>> polluted by transitive runtime deps, so protecting against bad imports is
>>> also a good idea.
>>>
>>> Kenn
>>>
>>> On Mon, Jan 14, 2019 at 8:42 AM Ismaël Mejía  wrote:
>>>
 Not yet, we need to add that too, there are still some tasks to be
 done like improve the contribution guide with this info, and document
 how to  generate a src build artifact locally since I doubt we can
 publish that into Apache for copyright reasons.
 I will message in the future for awareness for awareness when most of
 the pending tasks are finished.


 On Mon, Jan 14, 2019 at 3:51 PM Maximilian Michels 
 wrote:
 >
 > Thanks for the heads up, Ismaël! Great to see the vendored Guava
 version is used
 > everywhere now.
 >
 > Do we already have a Checkstyle rule that prevents people from using
 the
 > unvendored Guava? If not, such a rule could be useful because it's
 almost
 > inevitable that the unvedored Guava will slip back in.
 >
 > Cheers,
 > Max
 >
 > On 14.01.19 05:55, Ismaël Mejía wrote:
 > > We merged today the PR [1] that changes most of the code to use our
 > > new guava vendored dependency. In practice it means that most of the
 > > imports of the classes were changed from `com.google.common.` to
 > > `org.apache.beam.vendor.guava.v20_0.com.google.common.`
 > >
 > > This is a great improvement to fix a long existing problem of guava
 > > leaking through some Beam modules. This also reduces the size of
 most
 > > jars in the project because they don't need to relocate and include
 > > guava anymore, they just use the vendored dependency.
 > >
 > > Kudos to Kenn Knowles, Lukasz Cwik, Scott Wegner and the others that
 > > worked (are working) to make this possible.
 > >
 > > Sadly as a side effect of the merge of this PR multiple PRs were
 > > broken so please review if yours was and do a rebase and fix the
 > > imports to use the new vendored dependency. Sorry for the
 > > inconvenience. From now one all uses of guava should use the
 vendored
 > > version. Expect some updates in the docs.
 > >
 > > [1]  https://github.com/apache/beam/pull/6809
 > >

>>>

-- 
Cheers,
Gleb

Re: [spark runner based on dataset POC] your opinion

2019-01-18 Thread Gleb Kanterov

Agree with Kenn. It should be possible, Spark has a similar concept called
ExpressionEncoder, I was doing similar derivation using Scala macro in
typelevel/frameless .

Most of the code in Beam is a blackbox function in ParDo, and the only way
to translate it is using `mapPartition`, however, we could override
behavior for known transforms from beam-java-core, for instance, Group,
Select, and use FieldAccessDescriptor to pushdown projections. There is a
bigger opportunity for Beam SQL, that translates into a transforms that fit
more Spark DataFrame model.

Gleb


On Fri, Jan 18, 2019 at 3:25 PM Kenneth Knowles  wrote:

> I wonder if this could tie in with Reuven's recent work. He's basically
> making it so every type with an "obvious" schema automatically converts
> to/from Row whenever needed. Sounds like a similar need, superficially.
>
> Kenn
>
> On Fri, Jan 18, 2019, 02:36 Manu Zhang 
>> Hi Etienne,
>>
>> I see your point. I'm a bit worried that every ParDo has to be wrapped in
>> a `mapPartition` which introduces cost of serde and forgoes the benefits of
>> Dataset API.
>> Maybe Dataset is not the best idea to integrate Beam with Spark. Just my
>> $0.02.
>>
>> Manu
>>
>>
>> On Thu, Jan 17, 2019 at 10:44 PM Etienne Chauchot 
>> wrote:
>>
>>> Hi Manu,
>>> Yes a json schema can make its way to the spark source with no
>>> difficulty. but still we need to store windowedValue not only the elements
>>> that would comply with this schema. The problem is that spark will try to
>>> match the element (windowedValue) to the schema of the source at any
>>> element wise processing. (and downstream it will auto guess the schema with
>>> the content of dataset. For example if I extract timestamp in a pardo I get
>>> a Long schema in the output dataset). The problem is that windowedValue is
>>> complex and has many subclasses. Maybe bytes serialization is still the
>>> best way to go, but we don't leverage schema PCollections.
>>> Best
>>> Etienne
>>>
>>> Le jeudi 17 janvier 2019 à 21:52 +0800, Manu Zhang a écrit :
>>>
>>> Nice Try, Etienne ! Is it possible to pass in the schema through
>>> pipeline options ?
>>>
>>> Manu
>>>
>>> On Thu, Jan 17, 2019 at 5:25 PM Etienne Chauchot 
>>> wrote:
>>>
>>> Hi Kenn,
>>>
>>> Sure, in spark DataSourceV2 providing a schema is mandatory:
>>> - if I set it to null, I obviously get a NPE
>>> - if I set it empty: I get an array out of bounds exception
>>> - if I set it to Datatype.Null, null is stored as actual elements
>>> => Consequently I set it to binary.
>>>
>>> As the beam reader is responsible for reading both the element and the
>>> timestamp, the source outputs a Dataset. So, the solution I
>>> found, for which I asked your opinion, is to serialize windowedValue to
>>> bytes using beam FullWindowedValueCoder in reader.get() and deserialize the
>>> whole dataset once the source is done using a map to get the windowedValue
>>> back and give it to the transforms downstream.
>>>
>>> I am aware that this is not optimal because of the bytes serialization
>>> roundtrip, and I wanted your suggestions around that.
>>>
>>> Thanks
>>> Etienne
>>>
>>>
>>> Le mercredi 16 janvier 2019 à 19:04 -0800, Kenneth Knowles a écrit :
>>>
>>> Cool!
>>>
>>> I don't quite understand the issue in "bytes serialization to comply to
>>> spark dataset schemas to store windowedValues". Can you say a little more?
>>>
>>> Kenn
>>>
>>> On Tue, Jan 15, 2019 at 8:54 AM Etienne Chauchot 
>>> wrote:
>>>
>>> Hi guys,
>>> regarding the new (made from scratch) spark runner POC based on the
>>> dataset API, I was able to make a big step forward: it can now run a first
>>> batch pipeline with a source !
>>>
>>> See
>>> https://github.com/apache/beam/blob/spark-runner_structured-streaming/runners/spark-structured-streaming/src/test/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/SourceTest.java
>>>
>>> there is no test facilities for now, testmode is enabled and it just
>>> prints the output PCollection .
>>>
>>> I made some workarounds especially String serialization to pass beam
>>> objects (was forced to) and also bytes serialization to comply to spark
>>> dataset schemas to store windowedValues.
>>>
>>> Can you give me your thoughts especially regarding these last 2 matters?
>>>
>>> The other parts are not ready for showing yet
>>>
>>> Here is the whole branch:
>>>
>>>
>>> https://github.com/apache/beam/blob/spark-runner_structured-streaming/runners/spark-structured-streaming
>>>
>>> Thanks,
>>>
>>> Etienne
>>>
>>>

-- 
Cheers,
Gleb

Re: Vendoring Calcite

2019-01-14 Thread Gleb Kanterov

Great initiative. I was thinking about making a similar proposal. I tried
using Beam SQL in a project that has Calcite dependency, and it doesn't
work because Calcite does internal JDBC connection on "jdbc:calcite:" URL,
and you can't register two drivers for the same scheme. Not sure how it's
going to work out with vendoring, for instance, see Frameworks.java#L153

and
MaterializedViewTable.java#L59
,
probably, it needs to be addressed in Calcite itself.

One of the use-cases I see for vendored Calcite is being able to rely on
"developer" API to extend Beam SQL capabilities.

Gleb

On Mon, Jan 14, 2019 at 4:54 AM Kenneth Knowles  wrote:

> After doing the Guava vendoring as a practice run [1], I started on
> Calcite [2]. I have a couple issues, questions, suspicions I wanted to
> bring up to see if anyone had good ideas.
>
>  - Calcite has a bunch of transitive deps which I vendored with it.
>  - Calcite's transitive deps include proto. Relocating would break
> generated code unless it is also bundled. I don't really know the story
> here.
>  - Our SQL parser links into Calcite, so needs to share relocation. We do
> exactly this between vendored gRPC and the generated portability classes.
> Perhaps this parser package should be isolated from the main Beam SQL
> package.
>  - Codegen of SQL statements links into Calcite. Does it link the
> generated code with its own support library by using reflection (in which
> case relocation is probably fine) or by concatenating strings (in which
> case relocation would break things if we cannot configure a custom prefix)?
>
> It is a lot to deal with, but also a decent payoff. We know there are
> users who wanted to use their own version of Calcite but could not. Also
> you have to tweak your IntelliJ to step into Calcite's code while
> debugging, whereas after this vendoring that won't be necessary.
>
> Kenn
>
> [1] https://github.com/apache/beam/pull/7494
> [2] https://github.com/kennknowles/beam/commits/vendor-calcite
>

-- 
Cheers,
Gleb

Re: AvroIO read from unknown schema to generic record.

2019-01-14 Thread Gleb Kanterov

One approach could be creating PTransform with expand method that wraps
AvroIO and reads AVRO writer schema from one of files matching read pattern.

It will work if the set of sources with different schemas is fixed at
pipeline construction step.

```
public abstract class GenericAvroIORead extends PTransform> {
public abstract String from();

public static Schema getSchema(String from) {
  ResourceId resourceId = FileSystems. ... (from);
  InputStream inputStream = ... (resourceId);
  return new DataFileStream(inputStream, new
GenericDatumReader()).getSchema();
}

public PCollection expand(PBegin input) {
  Schema schema = getSchema(filepattern());

  return input.apply(AvroIO.readGenericRecords(schema).from(from()));
}

}
```

On Sun, Jan 13, 2019 at 11:41 PM Reuven Lax  wrote:

> It would probably be horribly inefficient, but might work. Not only would
> you need to embed the schema in every single record, I think you would need
> to create a new GenericDatumReader on each record. You might have trouble
> getting this to scale.
>
> BTW you might need to create a new encoding format for this. The problem
> is that GenericDatumReader (which we use to read Avro records) requires the
> schema ahead of time, and in your model we won't have the schema until a
> GenericRecord is produced (also I don't believe the encoded Avro record
> will include the schema).; this produces a chicken/egg problem. You would
> probably need to create a new encoding format that allowed you to store the
> schema along with the record. At that point this isn't AvroCoder, it's a
> coder for the new type you've created.
>
> Reuven
>
> On Sun, Jan 13, 2019 at 1:28 PM Niel Markwick  wrote:
>
>> This is my point though : AvroIO.parseAllGenericRecords() is able to
>> decode the object from the avro file into a GenericRecord  _without_
>> knowing the schema in advance as it uses the writer schema embedded in the
>> file.
>>
>> So can there be a GenericRecordAvroCoder which uses the schema embedded
>> in the GenericRecord to encode itself?
>>
>> On Sun, 13 Jan 2019, 16:59 Reuven Lax,  wrote:
>>
>>> AvroCoder needs to know the schema of the object in order to decode the
>>> object. Remember that in Beam the set of PCollections in a graph is static,
>>> so all the coders need to be known up front. To make things work with
>>> parseAllGenericRecords I think you would either need to embed the schema in
>>> every single record (which would be expensive), or you would need to create
>>> a new union type to represent the possible types (assuming that you know
>>> the possible schemas ahead of time).
>>>
>>> On Sat, Jan 12, 2019 at 12:09 PM Niel Markwick  wrote:
>>>
 Considering the transform is reading Avro container files, which by
 definition
 
 contain a schema, it should be possible for the reader to infer the schema
 from the file...

 parseAllGenericRecords()
 
  seems
 to be able to do this, decodes and passes a GenericRecord to the
 given parseFn without needing to know the schema in advance...

 In fact parseAllGenericRecords() would be perfect for my requirements
 if I could use a Contextful.Fn as a parseFn that accepted side imputs :/



 
 * •  **Niel Markwick*
 * •  *Cloud Solutions Architect
 * •  *Google Belgium
 * •  *ni...@google.com
 * •  *+32 2 894 6771

 Google Belgium NV/SA, Steenweg op Etterbeek 180
 
 , 1040 Brussel, Belgie. RPR: 0878.065.378

 If you received this communication by mistake, please don't forward it
 to anyone else (it may contain confidential or privileged information),
 please erase all copies of it, including all attachments, and please let
 the sender know it went to the wrong person. Thanks


 On Sat, 12 Jan 2019 at 20:08, Alex Van Boxel  wrote:

> Hey Niels,
>
> The reason you need to specify the schema to GenericRecord is that
> without it it's *impossible* for GenericRecord to make any sense of
> the binary data. Unlike protobuf, avro doesn't have any kind of 
> information
> in the message about the structure. This makes it smaller, but impossible
> to decode without the schema.
>
> So if you really want todo flexible messages, I would read it binary,
> message per message and handle your schema switching into a DoFn.
>
>  _/
> _/ Alex Van Boxel
>
>
> On Sat, Jan 12, 2019 at 7:44 PM Niel Markwick 
> wrote:
>
>> Is there a reason why don't we have an AvroIO reader that reads and
>> outputs a GenericRecord

Re: [Go SDK] User Defined Coders

2019-01-03 Thread Gleb Kanterov

Reuven, it sounds great. I see there is a similar thing to Row coders
happening in Apache Arrow , and there is a
similarity between Apache Arrow Flight

and data exchange service in portability. How do you see these two things
relate to each other in the long term?

On Fri, Jan 4, 2019 at 12:13 AM Reuven Lax  wrote:

> The biggest advantage is actually readability and usability. A secondary
> advantage is that it means that Go will be able to interact seamlessly with
> BeamSQL, which would be a big win for Go.
>
> A schema is basically a way of saying that a record has a specific set of
> (possibly nested, possibly repeated) fields. So for instance let's say that
> the user's type is a struct with fields named user, country, purchaseCost.
> This allows us to provide transforms that operate on field names. Some
> example (using the Java API):
>
> PCollection users = events.apply(Select.fields("user"));  // Select out
> only the user field.
>
> PCollection joinedEvents =
> queries.apply(Join.innerJoin(clicks).byFields("user"));  // Join two
> PCollections by user.
>
> // For each country, calculate the total purchase cost as well as the top
> 10 purchases.
> // A new schema is created containing fields total_cost and top_purchases,
> and rows are created with the aggregation results.
> PCollection purchaseStatistics = events.apply(
> Group.byFieldNames("country")
>.aggregateField("purchaseCost", Sum.ofLongs(),
> "total_cost"))
> .aggregateField("purchaseCost", Top.largestLongs(10),
> "top_purchases"))
>
>
> This is far more readable than what we have today, and what unlocks this
> is that Beam actually knows the structure of the record instead of assuming
> records are uncrackable blobs.
>
> Note that a coder is basically a special case of a schema that has a
> single field.
>
> In BeamJava we have a SchemaRegistry which knows how to turn user types
> into schemas. We use reflection to analyze many user types (e.g. simple
> POJO structs, JavaBean classes, Avro records, protocol buffers, etc.) to
> determine the schema, however this is done only when the graph is initially
> generated. We do use code generation (in Java we do bytecode generation) to
> make this somewhat more efficient. I'm willing to bet that the code
> generator you've written for structs could be very easily modified for
> schemas instead, so it would not be wasted work if we went with schemas.
>
> One of the things I'm working on now is documenting Beam schemas. They are
> already very powerful and useful, but since there is still nothing in our
> documentation about them, they are not yet widely used. I expect to finish
> draft documentation by the end of January.
>
> Reuven
>
> On Thu, Jan 3, 2019 at 11:32 PM Robert Burke  wrote:
>
>> That's an interesting idea. I must confess I don't rightly know the
>> difference between a schema and coder, but here's what I've got with a bit
>> of searching through memory and the mailing list. Please let me know if I'm
>> off track.
>>
>> As near as I can tell, a schema, as far as Beam takes it
>> 
>>  is
>> a mechanism to define what data is extracted from a given row of data. So
>> in principle, there's an opportunity to be more efficient with data with
>> many columns that aren't being used, and only extract the data that's
>> meaningful to the pipeline.
>> The trick then is how to apply the schema to a given serialization
>> format, which is something I'm missing in my mental model (and then how to
>> do it efficiently in Go).
>>
>> I do know that the Go client package for BigQuery
>>  does
>> something like that, using field tags. Similarly, the "encoding/json"
>>  package in the Go
>> Standard Library permits annotating fields and it will read out and
>> deserialize the JSON fields and that's it.
>>
>> A concern I have is that Go (at present) would require pre-compile time
>> code generation for schemas to be efficient, and they would still mostly
>> boil down to turning []bytes into real structs. Go reflection doesn't keep
>> up.
>> Go has no mechanism I'm aware of to Just In Time compile more efficient
>> processing of values.
>> It's also not 100% clear how Schema's would play with protocol buffers or
>> similar.
>> BigQuery has a mechanism of generating a JSON schema from a proto file
>> , but
>> that's only the specification half, not the using half.
>>
>> As it stands, the code generator I've been building these last months
>> could (in principle) statically analyze a user's struct, and then generate
>> an efficient dedicated coder for

Re: Adding ":beam-runners-direct-java:needsRunnerTests" to "Run Java PreCommit"

2018-12-28 Thread Gleb Kanterov

I looked into failing tests, we can put them into 5 categories:
- CastTest, invalid test case, fixed in apache/beam#7372
<https://github.com/apache/beam/pull/7372>
- UsesMetricsPusher tests running with direct-runner, attempted to fix in
apache/beam#7374 <https://github.com/apache/beam/pull/7374/commits>
- hanging tests using test
stream: BoundedReadFromUnboundedSourceTest#testTimeBound
and SplittableDoFnTest#testLateData, needs more investigation
- test using Watch transform: WatchTest, FileIOTest, AvroIOTest. Watch
transform needs refactoring after apache/beam#6467
<https://github.com/apache/beam/pull/6467>
- TFRecordIOTest, needs more investigation

Not sure what would be the best way to proceed while tests are failing.

On Fri, Dec 28, 2018 at 5:30 PM Reuven Lax  wrote:

> I think they run in postcommit. Seems like this was an oversight maybe.
>
> On Fri, Dec 28, 2018, 5:21 PM Ismaël Mejía 
>> Big +1
>>
>> I am in awe that they are not active anymore, they used to be in the
>> good old maven days, when IDE integration worked and most JIRAs were
>> not about build or test issues. Sigh.
>>
>> On Fri, Dec 28, 2018 at 3:24 PM Reuven Lax  wrote:
>> >
>> > Kenn and I both noticed that some needsRunner tests time out, and we
>> were both wondering why our PreCommit was still green. This tests are meant
>> to be quick, and IMO should definitely be part of Java PreCommit.
>> >
>> > On Fri, Dec 28, 2018 at 2:59 PM Gleb Kanterov  wrote:
>> >>
>> >> After reading Beam Testing I had an impression that NeedsRunner tests
>> are executed as a part of Java PreCommit using Direct runner. However, it
>> doesn't seem to be the case. I've tried running these tests locally, and
>> few of them are failing or timeout.
>> >>
>> >> My proposal is adding ":beam-runners-direct-java:needsRunnerTests" to
>> "Run Java PreCommit" Jenkins job, or running them in parallel as a part of
>> a new task, not to increase build time. Running needsRunnerTests on my
>> laptop took 3 minutes, so I just lean towards adding them to javaPreCommit.
>> Thoughts?
>> >>
>> >> > Task :beam-runners-direct-java:needsRunnerTests
>> >>
>> >> org.apache.beam.runners.core.metrics.MetricsPusherTest > test FAILED
>> >> java.lang.AssertionError at MetricsPusherTest.java:71
>> >>
>> >> org.apache.beam.sdk.io.TFRecordIOTest > testReadInvalidDataMask FAILED
>> >> java.lang.AssertionError
>> >>
>> >> org.apache.beam.sdk.io.TFRecordIOTest > testReadInvalidLengthMask
>> FAILED
>> >> java.lang.AssertionError
>> >>
>> >> org.apache.beam.sdk.io.AvroIOTest$NeedsRunnerTests >
>> testContinuouslyWriteAndReadMultipleFilepatterns[0: true] FAILED
>> >> java.lang.IllegalArgumentException at AvroIOTest.java:557
>> >>
>> >> org.apache.beam.sdk.io.AvroIOTest$NeedsRunnerTests >
>> testWriteThenReadJavaClass[0: true] FAILED
>> >> java.lang.AssertionError at AvroIOTest.java:306
>> >>
>> >> org.apache.beam.sdk.io.AvroIOTest$NeedsRunnerTests >
>> testContinuouslyWriteAndReadMultipleFilepatterns[1: false] FAILED
>> >> java.lang.IllegalArgumentException at AvroIOTest.java:557
>> >>
>> >> org.apache.beam.sdk.io.AvroIOTest$NeedsRunnerTests >
>> testWriteThenReadJavaClass[1: false] FAILED
>> >> java.lang.AssertionError at AvroIOTest.java:306
>> >>
>> >> org.apache.beam.sdk.io.FileIOTest > testMatchWatchForNewFiles FAILED
>> >> java.lang.IllegalArgumentException at FileIOTest.java:192
>> >>
>> >> org.apache.beam.sdk.io.TextIOReadTest$BasicIOTest >
>> testReadWatchForNewFiles FAILED
>> >> java.lang.IllegalArgumentException at TextIOReadTest.java:846
>> >>
>> >> org.apache.beam.sdk.schemas.transforms.CastTest > testTypeNarrowFail
>> FAILED
>> >> java.lang.AssertionError
>> >>
>> >> org.apache.beam.sdk.schemas.transforms.CastTest >
>> testWeakedNullableFail FAILED
>> >> java.lang.AssertionError
>> >>
>> >> org.apache.beam.sdk.transforms.SplittableDoFnTest > testBoundedness
>> FAILED
>> >>
>>  org.apache.beam.sdk.testing.TestPipeline$PipelineRunMissingException
>> >>
>> >> org.apache.beam.sdk.transforms.SplittableDoFnTest > testLateData FAILED
>> >> org.junit.runners.model.TestTimedOutException at
>> SplittableDoFnTest.java:683
>&

Adding ":beam-runners-direct-java:needsRunnerTests" to "Run Java PreCommit"

2018-12-28 Thread Gleb Kanterov

After reading Beam Testing  I
had an impression that NeedsRunner tests are executed as a part of Java
PreCommit using Direct runner. However, it doesn't seem to be the case.
I've tried running these tests locally, and few of them are failing or
timeout.

My proposal is adding ":beam-runners-direct-java:needsRunnerTests" to "Run
Java PreCommit" Jenkins job, or running them in parallel as a part of a new
task, not to increase build time. Running needsRunnerTests on my laptop
took 3 minutes, so I just lean towards adding them to javaPreCommit.
Thoughts?

> Task :beam-runners-direct-java:needsRunnerTests

org.apache.beam.runners.core.metrics.MetricsPusherTest > test FAILED
java.lang.AssertionError at MetricsPusherTest.java:71

org.apache.beam.sdk.io.TFRecordIOTest > testReadInvalidDataMask FAILED
java.lang.AssertionError

org.apache.beam.sdk.io.TFRecordIOTest > testReadInvalidLengthMask FAILED
java.lang.AssertionError

org.apache.beam.sdk.io.AvroIOTest$NeedsRunnerTests >
testContinuouslyWriteAndReadMultipleFilepatterns[0: true] FAILED
java.lang.IllegalArgumentException at AvroIOTest.java:557

org.apache.beam.sdk.io.AvroIOTest$NeedsRunnerTests >
testWriteThenReadJavaClass[0: true] FAILED
java.lang.AssertionError at AvroIOTest.java:306

org.apache.beam.sdk.io.AvroIOTest$NeedsRunnerTests >
testContinuouslyWriteAndReadMultipleFilepatterns[1: false] FAILED
java.lang.IllegalArgumentException at AvroIOTest.java:557

org.apache.beam.sdk.io.AvroIOTest$NeedsRunnerTests >
testWriteThenReadJavaClass[1: false] FAILED
java.lang.AssertionError at AvroIOTest.java:306

org.apache.beam.sdk.io.FileIOTest > testMatchWatchForNewFiles FAILED
java.lang.IllegalArgumentException at FileIOTest.java:192

org.apache.beam.sdk.io.TextIOReadTest$BasicIOTest >
testReadWatchForNewFiles FAILED
java.lang.IllegalArgumentException at TextIOReadTest.java:846

org.apache.beam.sdk.schemas.transforms.CastTest > testTypeNarrowFail FAILED
java.lang.AssertionError

org.apache.beam.sdk.schemas.transforms.CastTest > testWeakedNullableFail
FAILED
java.lang.AssertionError

org.apache.beam.sdk.transforms.SplittableDoFnTest > testBoundedness FAILED
org.apache.beam.sdk.testing.TestPipeline$PipelineRunMissingException

org.apache.beam.sdk.transforms.SplittableDoFnTest > testLateData FAILED
org.junit.runners.model.TestTimedOutException at
SplittableDoFnTest.java:683

org.apache.beam.sdk.transforms.WatchTest >
testSinglePollMultipleInputsWithSideInput FAILED
java.lang.IllegalArgumentException at WatchTest.java:112

org.apache.beam.sdk.transforms.WatchTest >
testMultiplePollsWithKeyExtractor FAILED
java.lang.IllegalArgumentException at WatchTest.java:203

org.apache.beam.sdk.transforms.WatchTest > testSinglePollMultipleInputs
FAILED
java.lang.IllegalArgumentException at WatchTest.java:82

org.apache.beam.sdk.transforms.WatchTest >
testMultiplePollsWithTerminationDueToTerminationCondition FAILED
java.lang.IllegalArgumentException at WatchTest.java:150

org.apache.beam.sdk.transforms.WatchTest > testMultiplePollsWithManyResults
FAILED
java.lang.IllegalArgumentException at WatchTest.java:308

org.apache.beam.sdk.transforms.WatchTest > testSinglePollWithManyResults
FAILED
java.lang.IllegalArgumentException at WatchTest.java:258

org.apache.beam.sdk.transforms.WatchTest >
testMultiplePollsStopAfterTimeSinceNewOutput FAILED
java.lang.IllegalArgumentException at WatchTest.java:232

org.apache.beam.sdk.transforms.WatchTest >
testMultiplePollsWithTerminationBecauseOutputIsFinal FAILED
java.lang.IllegalArgumentException at WatchTest.java:150

647 tests completed, 21 failed, 4 skipped

Re: Beam snapshots broken

2018-12-27 Thread Gleb Kanterov

I can reproduce this on my machine, and reverting
https://github.com/apache/beam/pull/7324 fixed the problem. There is a
separate thread in dev@ about releasing vendored gRPC v0.2, I'm wondering
if it will this issue.

On Thu, Dec 27, 2018 at 5:20 PM Ismaël Mejía  wrote:

> Looks like snapshots are broken again since 20 december, can somebody PTAL?
> Seems like some part of the vendoring could be related to this failure
> (maybe it is looking for the unpublished version)?
>
> Running some tests in one existing application I found this
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time
> elapsed: 0.447 s <<< FAILURE! - in SerializationTest
> [ERROR] nonSerilizableTest(SerializationTest)  Time elapsed: 0.028 s  <<<
> ERROR!
> java.lang.NoClassDefFoundError:
> org/apache/beam/vendor/grpc/v1p13p1/com/google/protobuf/ProtocolMessageEnum
> at SerializationTest.nonSerilizableTest(SerializationTest.java:27)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.beam.vendor.grpc.v1p13p1.com.google.protobuf.ProtocolMessageEnum
> at SerializationTest.nonSerilizableTest(SerializationTest.java:27)
>
> On Thu, Dec 13, 2018 at 9:13 AM Mark Liu  wrote:
> >
> > Looks like the recent failure (like this job) is related to
> ':beam-sdks-python:test' change introduced in this PR. `./gradlew
> :beam-sdks-python:test` can reproduce the error.
> >
> > Testing a fix in PR7273.
> >
> > On Wed, Dec 12, 2018 at 8:31 AM Yifan Zou  wrote:
> >>
> >> Beam9 is offline right now. But, the job also failed on beam4 and 13
> with "Could not determine the dependencies of task
> ':beam-sdks-python:test.".
> >> Seems like the task dependency did not setup properly.
> >>
> >>
> >>
> >> On Wed, Dec 12, 2018 at 2:03 AM Ismaël Mejía  wrote:
> >>>
> >>> You are right it seems that it was related to beam9 (wondering if it
> >>> was bad luck that it was always assigned to beam9 or we can improve
> >>> that poor balancing error).
> >>> However it failed again today against beam13 maybe this time is just a
> >>> build issue but seems related to python too.
> >>>
> >>> On Tue, Dec 11, 2018 at 7:33 PM Boyuan Zhang 
> wrote:
> >>> >
> >>> > Seems like, all failed jobs are not owing to the single task
> failure. There failed task were executed on beam9, which was rebooted
> yesterday because python tests failed continuously. +Yifan Zou may have
> more useful content here.
> >>> >
> >>> > On Tue, Dec 11, 2018 at 9:10 AM Ismaël Mejía 
> wrote:
> >>> >>
> >>> >> It seems that Beam snapshots are broken since Dec. 2
> >>> >>
> https://builds.apache.org/view/A-D/view/Beam/job/beam_Release_Gradle_NightlySnapshot/
> >>> >>
> >>> >> It seems "The :beam-website:startDockerContainer task failed."
> >>> >> Can somebody please take a look.
>


-- 
Cheers,
Gleb

Re: [ANNOUNCEMENT] [SQL] [BEAM-6133] Support for user defined table functions (UDTF)

2018-12-14 Thread Gleb Kanterov

Kenn, I don't have a copy of a recent SQL standard to confirm what I'm
saying. To my knowledge, initially, there was a concept of a table
function. Table functions should have a static type that doesn't depend on
supplied arguments. In ANSI SQL 2016, there is a concept of polymorphic
table functions, that can infer types depending on provided arguments. Both
TableFunction and TableMacro in Calcite are polymorphic table functions,
and the difference between TableFunction and TableMacro is internal to
Calcite.

Gleb



On Fri, Dec 14, 2018 at 4:26 AM Kenneth Knowles  wrote:

> Sorry for the slow reply & review. Having UDTF support in Beam SQL is
> extremely useful. Are both table functions and table macros part of
> "standard" SQL or is this a distinction between different Calcite concepts?
>
> Kenn
>
> On Wed, Nov 28, 2018 at 10:36 AM Gleb Kanterov  wrote:
>
>> At the moment we support only ScalarFunction UDF, it's functions that
>> operate on row fields. In Calcite, there are 3 kinds of UDFs: aggregate
>> functions (that we already support), table macro and table functions. The
>> difference between table functions and macros is that macros expand to
>> relations, and table functions can refer to anything queryable, e.g.,
>> enumerables. But in the case of Beam SQL, given everything translates to
>> PTransforms, only table macros are relevant.
>>
>> UDTF are in a way similar to external tables but don't require to specify
>> a schema explicitly. Instead, they can derive schema based on arguments.
>> One of the use-cases would be querying ranges of dataset partitions using a
>> helper function like:
>>
>> SELECT COUNT(*) FROM table(readAvro(id => 'dataset', start =>
>> '2017-01-01', end => '2018-01-01'))
>>
>> With BEAM-6133 <https://issues.apache.org/jira/browse/BEAM-6133> (
>> apache/beam/#7141 <https://github.com/apache/beam/pull/7141>) we would
>> have support for UDTF in Beam SQL.
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-6133
>> [2] https://github.com/apache/beam/pull/7141
>>
>> Gleb
>>
>

-- 
Cheers,
Gleb

[ANNOUNCEMENT] [SQL] [BEAM-6133] Support for user defined table functions (UDTF)

2018-11-28 Thread Gleb Kanterov

At the moment we support only ScalarFunction UDF, it's functions that
operate on row fields. In Calcite, there are 3 kinds of UDFs: aggregate
functions (that we already support), table macro and table functions. The
difference between table functions and macros is that macros expand to
relations, and table functions can refer to anything queryable, e.g.,
enumerables. But in the case of Beam SQL, given everything translates to
PTransforms, only table macros are relevant.

UDTF are in a way similar to external tables but don't require to specify a
schema explicitly. Instead, they can derive schema based on arguments. One
of the use-cases would be querying ranges of dataset partitions using a
helper function like:

SELECT COUNT(*) FROM table(readAvro(id => 'dataset', start => '2017-01-01',
end => '2018-01-01'))

With BEAM-6133  (
apache/beam/#7141 ) we would have
support for UDTF in Beam SQL.

[1] https://issues.apache.org/jira/browse/BEAM-6133
[2] https://github.com/apache/beam/pull/7141

Gleb

Re: Fixing equality of Rows

2018-10-29 Thread Gleb Kanterov

Rui, I'm not completely sure I understand why it isn't possible to find
suitable encoding for portability. As I understand, the only requirement is
deterministic encoding consistent with equality, so existing representation
of BYTES will work (VarInt followed by bytes). In my understanding, it's up
to SDK to decide what is the best representation of byte[] in each
language. In our case, for Java, we fix equality by wrapping byte[] into an
object.

On Mon, Oct 29, 2018 at 6:42 PM Lukasz Cwik  wrote:

> I believe Kenn is spot on. The focus of the issue is too narrow as your
> talking about the short term problem related to Map.
> Schemas are very similar to coders and coders have been solving this
> problem by delegating to the underlying component coder to figure out
> whether two things are equal. You could make all the coders used within
> Schema's deterministic.
> For example, what prevents you from using a MapCoder that has been made to
> be deterministic by sorting the encoded keys lexicographically?
>
> On Mon, Oct 29, 2018 at 10:27 AM Rui Wang  wrote:
>
>> Seems to me that Only Map's quality check cannot be solved by
>> deepEquals because Keys cannot be looked up correctly in Map.
>> If we cannot have a useful use case for Map,  we could reject
>> it in Schema and still keep byte[].
>>
>> The option3 needs to find a wrapper of byte[] that is
>> language-independent & encoding-independent for portability, which seems a
>> hard searching (and not possible?) process.
>>
>> -Rui
>>
>> On Mon, Oct 29, 2018 at 10:15 AM Gleb Kanterov  wrote:
>>
>>> There is an indirect connection to RowCoder because `MapCoder` isn't
>>> deterministic, therefore, this doesn't hold:
>>>
>>> >  - also each type (hence Row type) should have portable encoding(s)
>>> that respect this equality so shuffling is consistent
>>>
>>> I think it's a requirement only for rows we want to shuffle by.
>>>
>>> > About these specific use cases, how useful is it to support
>>> Map and List?
>>>
>>> Not sure about Map, but in BigQuery it's possible to define
>>> `ARRAY` type. It can group by BYTES, but not by ARRAYS.
>>>
>>>
>>> On Mon, Oct 29, 2018 at 5:42 PM Anton Kedin  wrote:
>>>
>>>> About these specific use cases, how useful is it to support Map>>> ?> and List? These seem pretty exotic (maybe they aren't) and I
>>>> wonder whether it would make sense to just reject them until we have a
>>>> solid design.
>>>>
>>>> And wouldn't the same problems arise even without RowCoder? Is the path
>>>> in that case to implement a custom coder?
>>>>
>>>> Regards,
>>>> Anton
>>>>
>>>>
>>>> On Mon, Oct 29, 2018 at 9:05 AM Kenneth Knowles 
>>>> wrote:
>>>>
>>>>> I'll summarize my input to the discussion. It is rather high level.
>>>>> But IMO:
>>>>>
>>>>>  - even though schemas are part of Beam Java today, I think they
>>>>> should become part of portability when ready
>>>>>  - so each type in a schema needs a language-independent &
>>>>> encoding-independent notion of domain of values and equality (so obviously
>>>>> equal bytes are equal)
>>>>>  - embedding in any language (hence Row in Java) must have a schema
>>>>> type-driven equality that matches this spec
>>>>>  - also each type (hence Row type) should have portable encoding(s)
>>>>> that respect this equality so shuffling is consistent
>>>>>  - Row in Java should be able to decode these encodings to different
>>>>> underlying representations and change its strategy over time
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Mon, Oct 29, 2018 at 8:08 AM Gleb Kanterov 
>>>>> wrote:
>>>>>
>>>>>> With adding BYTES type, we broke equality.
>>>>>> `RowCoder#consistentWithEquals` is always true, but this property doesn't
>>>>>> hold for exotic types such as `Map`, `List`. The root
>>>>>> cause is `byte[]`, where `equals` is implemented as reference equality
>>>>>> instead of structural.
>>>>>>
>>>>>> Before we jump into solution mode, let's state what we want to have:
>>>>>> - *API* have stable API and be able to evolve efficient and use-case
>>>>>> specific implementations without breaking it
>>>>>>

Re: Fixing equality of Rows

2018-10-29 Thread Gleb Kanterov

There is an indirect connection to RowCoder because `MapCoder` isn't
deterministic, therefore, this doesn't hold:

>  - also each type (hence Row type) should have portable encoding(s) that
respect this equality so shuffling is consistent

I think it's a requirement only for rows we want to shuffle by.

> About these specific use cases, how useful is it to support Map and List?

Not sure about Map, but in BigQuery it's possible to define `ARRAY`
type. It can group by BYTES, but not by ARRAYS.


On Mon, Oct 29, 2018 at 5:42 PM Anton Kedin  wrote:

> About these specific use cases, how useful is it to support Map
> and List? These seem pretty exotic (maybe they aren't) and I wonder
> whether it would make sense to just reject them until we have a solid
> design.
>
> And wouldn't the same problems arise even without RowCoder? Is the path in
> that case to implement a custom coder?
>
> Regards,
> Anton
>
>
> On Mon, Oct 29, 2018 at 9:05 AM Kenneth Knowles  wrote:
>
>> I'll summarize my input to the discussion. It is rather high level. But
>> IMO:
>>
>>  - even though schemas are part of Beam Java today, I think they should
>> become part of portability when ready
>>  - so each type in a schema needs a language-independent &
>> encoding-independent notion of domain of values and equality (so obviously
>> equal bytes are equal)
>>  - embedding in any language (hence Row in Java) must have a schema
>> type-driven equality that matches this spec
>>  - also each type (hence Row type) should have portable encoding(s) that
>> respect this equality so shuffling is consistent
>>  - Row in Java should be able to decode these encodings to different
>> underlying representations and change its strategy over time
>>
>> Kenn
>>
>> On Mon, Oct 29, 2018 at 8:08 AM Gleb Kanterov  wrote:
>>
>>> With adding BYTES type, we broke equality.
>>> `RowCoder#consistentWithEquals` is always true, but this property doesn't
>>> hold for exotic types such as `Map`, `List`. The root
>>> cause is `byte[]`, where `equals` is implemented as reference equality
>>> instead of structural.
>>>
>>> Before we jump into solution mode, let's state what we want to have:
>>> - *API* have stable API and be able to evolve efficient and use-case
>>> specific implementations without breaking it
>>> - *Correctness *we can't trade off correctness, a trivial
>>> implementation should work
>>> - *Performance *comparing equality is a fundamental operation, and we
>>> want to make it cheap
>>>
>>> 1. set `consistentWithEquals` if there is BYTES field
>>> Pros: almost no pros
>>> Cons: It would introduce a significant number of allocations when
>>> comparing rows, so we reject this option.
>>>
>>> 2. implement custom deep equals in `Row#equals`
>>> Pros: good performance, doesn't change API, `Row#equals` is correct
>>> Cons: doesn't work for `Map`, unless we roll own
>>> implementation
>>> Cons: it's possible to obtain `List` from `getValue()` that has
>>> broken equality, contains, etc, unless we roll own implementation
>>> Cons: non-trivial and requires ~200LOC to implement
>>>
>>> 3. wrapping byte[] into Java object with fixed equality (e.g.,
>>> StructuralByteArray)
>>> Pros: good performance and flexible to change how Java wrapper is
>>> implemented
>>> Pros: simple, doesn't require any specialized collections, no surprises,
>>> `Map` and `List` work.
>>> Cons: will change the return type of `Row#getValue`
>>>
>>> I want to suggest going with option #3. However, it isn't completely
>>> clear what wrapper we want to use, either it could be StructuralByteArray,
>>> or ByteBuffer. ByteBuffer is more standard. However, it comes with 4
>>> additional integer fields. StructuralByteArray doesn't have anything not
>>> necessary. One option would be adding `Row#getByteBuffer` that would be
>>> `ByteBuffer.wrap(getValue(i).getValues())`, specialized implementation can
>>> override it for better performance, but `getValue(i)` must return
>>> StructuralByteArray.
>>>
>>> References:
>>> - [BEAM-5866] Fix `Row#equals`, https://github.com/apache/beam/pull/6845
>>> - [BEAM-5646] Fix quality and hashcode for bytes in Row,
>>> https://github.com/apache/beam/pull/6765
>>>
>>> Gleb
>>>
>>

-- 
Cheers,
Gleb

Fixing equality of Rows

2018-10-29 Thread Gleb Kanterov

With adding BYTES type, we broke equality. `RowCoder#consistentWithEquals`
is always true, but this property doesn't hold for exotic types such as
`Map`, `List`. The root cause is `byte[]`, where `equals`
is implemented as reference equality instead of structural.

Before we jump into solution mode, let's state what we want to have:
- *API* have stable API and be able to evolve efficient and use-case
specific implementations without breaking it
- *Correctness *we can't trade off correctness, a trivial implementation
should work
- *Performance *comparing equality is a fundamental operation, and we want
to make it cheap

1. set `consistentWithEquals` if there is BYTES field
Pros: almost no pros
Cons: It would introduce a significant number of allocations when comparing
rows, so we reject this option.

2. implement custom deep equals in `Row#equals`
Pros: good performance, doesn't change API, `Row#equals` is correct
Cons: doesn't work for `Map`, unless we roll own implementation
Cons: it's possible to obtain `List` from `getValue()` that has
broken equality, contains, etc, unless we roll own implementation
Cons: non-trivial and requires ~200LOC to implement

3. wrapping byte[] into Java object with fixed equality (e.g.,
StructuralByteArray)
Pros: good performance and flexible to change how Java wrapper is
implemented
Pros: simple, doesn't require any specialized collections, no surprises,
`Map` and `List` work.
Cons: will change the return type of `Row#getValue`

I want to suggest going with option #3. However, it isn't completely clear
what wrapper we want to use, either it could be StructuralByteArray, or
ByteBuffer. ByteBuffer is more standard. However, it comes with 4
additional integer fields. StructuralByteArray doesn't have anything not
necessary. One option would be adding `Row#getByteBuffer` that would be
`ByteBuffer.wrap(getValue(i).getValues())`, specialized implementation can
override it for better performance, but `getValue(i)` must return
StructuralByteArray.

References:
- [BEAM-5866] Fix `Row#equals`, https://github.com/apache/beam/pull/6845
- [BEAM-5646] Fix quality and hashcode for bytes in Row,
https://github.com/apache/beam/pull/6765

Gleb

Re: Growing Beam -- A call for ideas? What is missing? What would be good to see?

2018-10-29 Thread Gleb Kanterov

I'm a scio contributor, and I have a lot of experience with Scala. However,
I would advise for NOT using Scala. There are several problems with
maintaining Scala libraries:

- have to build different artifacts for each Scala version
- artifacts have dependencies to Scala standard library
- it becomes even a bigger problem with Scala 3 migration
- Scala is a very complex language and requires a lot of discipline

Because of these issues, you see how much time it takes to upgrade Spark to
newer Scala version, or libraries from the Twitter ecosystem.

Gleb

On Sun, Oct 28, 2018 at 10:10 PM Kenneth Knowles  wrote:

> Porting to Scio is not necessary. I expect you can use Scala main() + Java
> SDK + Scala DPASF no problem. Then Scio users can use it also, of course,
> and so can Java SDK users.
>
> Doesn't Scala compile to jars that are somewhat usable from Java? I've
> only ever gone the other way, but I thought it was somewhat both ways. That
> would mean Java main() + Java SDK + Scala DPASF is also viable. But there's
> still the matter of supporting Scala in our build system & codebase.
> Personally SGTM since I know Scala well but I wouldn't want to have code
> that only a couple people are comfortable modifying*. From a whole-project
> perspective I would yield to the broader Beam dev community on this issue.
>
> I took a quick look at your Scala and it does look like it would mostly
> port to Java quite easily, just a bit more boilerplate. Do you know of
> particular things that might be hard?
>
> Kenn
>
> *our Jenkins and Gradle groovy scripts are further from Java than Scala
> and we seem to be doing OK-but-not-great as far as everyone feeling OK to
> modify them
>
>
> On Sat, Oct 27, 2018 at 1:43 AM David Morávek 
> wrote:
>
>> Hello Alejandro,
>>
>> +1 for java implementation, even though this would probably require more
>> effort from your side
>>
>> The main problem with Scio is that it lives outside beam code base and
>> depends on specific version of Beam SDK. The sketching extension (and any
>> other module in beam code base) on the other hand uses Beam SDK that is
>> build from sources (current checkout), that Scio might not be compatible
>> with.
>>
>> D.
>>
>> On Sat, Oct 27, 2018 at 8:26 AM Alejandro  wrote:
>>
>>> Hello,
>>>
>>> although not exactly your intentions, I am also looking to contribute to
>>> Beam, but from a code perspective.
>>>
>>> I've been discussing with some beam members like Austin and lukasz
>>> (CCed) on how to integrate https://github.com/elbaulp/DPASF into Beam.
>>>
>>> It seems the best place for this algorithms is
>>> https://github.com/apache/beam/tree/master/sdks/java/extensions/sketching
>>> ,
>>> but right now I lack the Beam knowledge that allow my to implement it.
>>> So I am looking to someone who could help me start. Should I write
>>> wrappers that interface my Scala code using
>>> https://github.com/spotify/scio? or re implement all in Java?
>>>
>>> Cheers.
>>>
>>> On 10/26/2018 11:55 PM, Rose Nguyen wrote:
>>> > I've heard of many people referring to the Medium posts related to Beam
>>> > for step-by-step tutorials.
>>> >
>>> > https://medium.com/tag/apache-beam/latest
>>> >
>>> > On Thu, Oct 25, 2018 at 9:25 PM Austin Bennett
>>> > mailto:whatwouldausti...@gmail.com>>
>>> wrote:
>>> >
>>> > Hi Beam Devs and Users,
>>> >
>>> > Trying to get a sense from the community on the sorts of things we
>>> > think would be useful to build the community (I am thinking not
>>> from
>>> > an angle of specific code/implementation/functionality, but from a
>>> > user/usability -- I want to dive in and make real contributions
>>> with
>>> > the code, too, but know I also have the interest and skills to help
>>> > with education and community aspects, hence my focus on this).
>>> >
>>> > I had previously suggested a sort of cookbook for focused and
>>> > curated examples (code and explination) to help people get started,
>>> > on-boarding, using Beam to aid getting up and running and
>>> > accomplishing something worthwhile (and quickly), that seems one
>>> way
>>> > to help grow our user base (and maybe future dev base afterwards
>>> > those users become enamored), which did get some positive feedback
>>> > when first put out there.
>>> >
>>> > There are many other areas where featuring others sharing successes
>>> > from having used Beam or little tips can be valuable, Pablo's
>>> > Awesome Beam is one example of such a
>>> > collection: https://github.com/pabloem/awesome-beam or even
>>> > centralizing a general place to find any/all Beam
>>> > blogs/shared-code/writeups/etc.
>>> >
>>> > Certainly there is a place for all sorts of contributions and
>>> > resources.  What do people on these lists think would be
>>> > particularly useful?  Trying to get a more focused sense of where
>>> we
>>> > think efforts might be best focused.
>>> >
>>> > Please share anything (even

Re: A new contributor

2018-10-05 Thread Gleb Kanterov

Hi all,

My name is Gleb and I work on Data Infrastructure at Spotify. We use Apache
Beam and develop spotify/scio .
Time-to-time I create JIRA issues and submit pull requests. Can I get
contributor access to JIRA (username: kanterov) and Slack?

Thanks,
Gleb

On Fri, Oct 5, 2018 at 2:15 AM Thomas Weise  wrote:

> Currently we add contributors for edit access to confluence wiki when it
> is actually required, on-demand. It can be requested on this list or on
> slack and a PMC member will take care of it.
>
> Thanks,
> Thomas
>
>
> On Thu, Oct 4, 2018 at 4:58 PM Mikhail Gryzykhin 
> wrote:
>
>> UPD: I've assembled this list here, so that this becomes a reference for
>> docs update.
>>
>> --Mikhail
>>
>> Have feedback ?
>>
>>
>> On Thu, Oct 4, 2018 at 4:43 PM Mikhail Gryzykhin 
>> wrote:
>>
>>> I believe this list will also require:
>>>
>>> * Access to: https://cwiki.apache.org/confluence/display/BEAM/ \
>>> Sam, please create account, then ping here so that committer grants
>>> you access rights. (I believe this is correct order)
>>> * Refresh link to join Slack.
>>> I've invited Sam directly. Since we always need to refresh join
>>> slack link, I guess it is easier to change instruction to just ask
>>> committer to add new contributors to slack channel.
>>> * @Kenneth Knowles  Thank you for creating Jira account.
>>>
>>> --Mikhail
>>>
>>> Have feedback ?
>>>
>>>
>>> On Thu, Oct 4, 2018 at 12:06 PM Kenneth Knowles  wrote:
>>>
 I added you to the Contributor role, so you can be assigned JIRAs.
 (assuming your JIRA username is samrohde)

 Kenn

 On Thu, Oct 4, 2018 at 11:31 AM Sam Rohde  wrote:

> Hi all!
>
> My name is Sam and I work for Google Cloud Dataflow. I'm going to be
> starting work on Apache Beam soon and I wish to be added as a contributor
> in the Beam issue tracker for JIRA as well as any other necessary
> permissions to start work.
>
> Thanks,
> Sam
>


-- 
Cheers,
Gleb

85 matches

Mail list logo