Re: [VOTE] Vendored Dependencies Release Byte Buddy 1.10.8 RC2

2020-02-29 Thread Ismaël Mejía
I'm happy to announce that we have unanimously approved this release.

There are 6 approving votes, 5 of which are binding:
* Luke Cwik
* Pablo Estrada
* Robert Bradshaw
* Ismaël Mejía
* Kenneth Knowles

There are no disapproving votes.

Thanks everyone!


On Fri, Feb 28, 2020 at 6:39 PM Kenneth Knowles  wrote:

> +1 (binding)
>
> On Fri, Feb 28, 2020 at 8:19 AM Ismaël Mejía  wrote:
>
>> +1 (binding)
>>
>> On Wed, Feb 26, 2020 at 10:28 PM Robert Bradshaw 
>> wrote:
>>
>>> +1 (binding)
>>>
>>> On Wed, Feb 26, 2020 at 1:11 PM Pablo Estrada 
>>> wrote:
>>> >
>>> > +1 (binding)
>>> > Verified hashes.
>>> > Thank you Ismael!
>>> >
>>> > On Wed, Feb 26, 2020 at 11:30 AM Luke Cwik  wrote:
>>> >>
>>> >> +1 (binding)
>>> >>
>>> >> Verified signatures and contents of jar to not contain
>>> module-info.class
>>> >>
>>> >> On Wed, Feb 26, 2020 at 10:45 AM Kai Jiang 
>>> wrote:
>>> >>>
>>> >>> +1 (non-binding)
>>> >>>
>>> >>> On Wed, Feb 26, 2020 at 01:23 Ismaël Mejía 
>>> wrote:
>>> 
>>>  Please review the release of the following artifacts that we vendor:
>>>   * beam-vendor-bytebuddy-1_10_8
>>> 
>>>  Hi everyone,
>>>  Please review and vote on the release candidate #2 for the version
>>> 0.1, as follows:
>>>  [ ] +1, Approve the release
>>>  [ ] -1, Do not approve the release (please provide specific
>>> comments)
>>> 
>>>  The complete staging area is available for your review, which
>>> includes:
>>>  * the official Apache source release to be deployed to
>>> dist.apache.org [1], which is signed with the key with fingerprint
>>> 3415631729E15B33051ADB670A9DAF6713B86349 [2],
>>>  * all artifacts to be deployed to the Maven Central Repository [3],
>>>  * commit hash "63492776f154464f67533a6059f162e6b8cf7315" [4],
>>> 
>>>  The vote will be open for at least 72 hours. It is adopted by
>>> majority approval, with at least 3 PMC affirmative votes.
>>> 
>>>  Thanks,
>>>  Release Manager
>>> 
>>>  [1] https://dist.apache.org/repos/dist/dev/beam/vendor/
>>>  [2] https://dist.apache.org/repos/dist/release/beam/KEYS
>>>  [3]
>>> https://repository.apache.org/content/repositories/orgapachebeam-1094/
>>>  [4]
>>> https://github.com/apache/beam/commit/63492776f154464f67533a6059f162e6b8cf7315
>>> 
>>>
>>


Re: [ANNOUNCE] New Committer: Kamil Wasilewski

2020-02-29 Thread Reza Rokni
Congratilation Kamil

On Sat, 29 Feb 2020, 06:18 Udi Meiri,  wrote:

> Welcome Kamil!
>
> On Fri, Feb 28, 2020 at 12:53 PM Mark Liu  wrote:
>
>> Congrats, Kamil!
>>
>> On Fri, Feb 28, 2020 at 12:23 PM Ismaël Mejía  wrote:
>>
>>> Congratulations Kamil!
>>>
>>> On Fri, Feb 28, 2020 at 7:09 PM Yichi Zhang  wrote:
>>>
 Congrats, Kamil!

 On Fri, Feb 28, 2020 at 9:53 AM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> Congratulations, Kamil!
>
> On Fri, Feb 28, 2020 at 9:34 AM Pablo Estrada 
> wrote:
>
>> Hi everyone,
>>
>> Please join me and the rest of the Beam PMC in welcoming a new
>> committer: Kamil Wasilewski
>>
>> Kamil has contributed to Beam in many ways, including the performance
>> testing infrastructure, and a custom BQ source, along with other
>> contributions.
>>
>> In consideration of his contributions, the Beam PMC trusts him with
>> the responsibilities of a Beam committer[1].
>>
>> Thanks for your contributions Kamil!
>>
>> Pablo, on behalf of the Apache Beam PMC.
>>
>> [1] https://beam.apache.org/contribute/become-a-committer
>> /#an-apache-beam-committer
>>
>>


Re: Issue with KafkaIO for list of topics

2020-02-29 Thread Maulik Soneji
Hello Rahul,

Thanks again for the detailed explanation.

I require some guidance on what values to be set for maxDelay and
previousWatermark for CustomTimestampPolicyWithLimitedDelay.

Currently, I was providing maxDelay as Duration.ZERO and previousWatermark
as Optional.empty().
With these values I see that the getWatermark function always goes to else
block(code link) and always returns TIMESTAMP_MIN_VALUE.
So with this case as well, I see that the watermark is returned as
TIMESTAMP_MIN_VALUE for zero throughput topics.

Please share your observations on how to tune the Timestamp Policy.

Thanks and regards,
Maulik


On Fri, Feb 28, 2020 at 8:46 PM rahul patwari 
wrote:

> Hi Maulik,
>
> Currently, I don't think it is possible to filter topics based on whether
> data is being produced to the topic (or) not.
> But, the Watermark logic can be changed to make the Pipeline work.
>
> Since the timestamps of the records are the time when the events are
> pushed to Kafka, every record will have monotonically increasing timestamps
> except for out of order events.
> Instead of assigning the Watermark as BoundedWindow.TIMESTAMP_MIN_VALUE
> by default, we can assign [current_timestamp - some_delay] as default and
> the same can be done in getWatermark() method, in which case, even if the
> partition is idle, Watermark will advance.
>
> Make sure that the timestamp of the Watermark is monotonically increasing
> and choose the delay carefully in order to avoid discarding out of order
> events.
>
> Refer
> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/CustomTimestampPolicyWithLimitedDelay.java
> for an example.
>
> Regards,
> Rahul
>
>
> On Fri, Feb 28, 2020 at 6:54 PM Maulik Soneji 
> wrote:
>
>> Hi Rahul,
>>
>> Thank you very much for the detailed explanation.
>>
>> Since we don't know which are the topics that have zero throughputs, is
>> there a way in which we can filter out such topics in KafkaIO?
>>
>> Since KafkaIO doesn't support passing a regex to consume data from, I am
>> getting a list of topics from kafka and passing it.
>>
>> Is there a way to filter out such topics? Also, it can happen that when
>> the job has started the topic might have no data for a few windows and
>> after that, it can get some data. This filter should be dynamic as well.
>>
>> Please share some ideas on how we can make this work.
>>
>> Community members, please share your thoughts as well on how we can
>> achieve this.
>>
>> Thanks and regards,
>> Maulik
>>
>> On Fri, Feb 28, 2020 at 3:03 PM rahul patwari 
>> wrote:
>>
>>> Hi Maulik,
>>>
>>> This seems like an issue with Watermark.
>>> According to
>>> https://github.com/apache/beam/blob/f0930f958d47042948c06041e074ef9f1b0872d9/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaUnboundedReader.java#L240
>>> ,
>>>
>>> If there are multiple partitions (or) multiple topics, Watermark will be
>>> calculated for each of the partition and the minimum watermark is
>>> considered as the current Watermark.
>>> Assuming that no message is pushed to the topic with 0 throughput,
>>> according to your logic for the watermark calculation, the watermark of
>>> each partition for this topic will be BoundedWindow.TIMESTAMP_MIN_VALUE
>>> (the smallest representable timestamp of an element -
>>> https://github.com/apache/beam/blob/f0930f958d47042948c06041e074ef9f1b0872d9/model/pipeline/src/main/proto/beam_runner_api.proto#L44
>>> ).
>>>
>>> As the result will be emitted from GroupByKey when the Watermark crosses
>>> the window and as the watermark is BoundedWindow.TIMESTAMP_MIN_VALUE,
>>> you are not seeing the results from GroupByKey.
>>>
>>> Regards,
>>> Rahul
>>>
>>> On Fri, Feb 28, 2020 at 12:39 PM Maulik Soneji 
>>> wrote:
>>>
 *Observations:*
 If we read using KafkaIO for a list of topics where one of the topics
 has zero throughputs,
 and KafkaIO is followed by GroupByKey stage, then:
 a. No data is output from GroupByKey stage for all the topics and not
 just the zero throughput topic.

 If all topics have some throughput coming in, then it works fine and we
 get some output from GroupByKey stage.

 Is this an issue?

 *Points:*
 a. The output from GroupByKey is only when all topics have some
 throughput
 b. This is a problem with KafkaIO + GroupByKey, for case where I have
 FileIO + GroupByKey, this issue doesn't arise. GroupByKey outputs some data
 even if there is no data for one of the files.
 c. Not a runner issue, since I ran it with FlinkRunner and
 DataflowRunner
 d. Even if lag is different for each topic on the list, we still get
 some output from GroupByKey.


 *Debugging:*While Debugging this issue I found that in split function
 of KafkaUnboundedSource we create KafkaUnboundedSource where partition list
 is one partition for each topic.

 I am not sure if this is some issue with 

Re: GSOC announced!

2020-02-29 Thread Can Gencer
Hi,

I'd like to also mention that Hazelcast is part of GSoC this year and one
of the proposal ideas we have put forward is to implement the portable
runner for Hazelcast Jet (there is already an "old-style" runner). It's a
good opportunity to understand the internals of a stream processing engine
as well as working with Apache Beam and you will get a lot of
support/mentoring from the Hazelcast Jet team. If anyone is interested, you
can contact me directly also.

See proposal 4 on this page:
https://summerofcode.withgoogle.com/organizations/6574602056105984/

On Mon, Jan 27, 2020 at 8:23 PM Rui Wang  wrote:

> Hi Xinbin,
>
> that sounds enough to get started. You can get permission on JIRA and then
> assign JIRA that you want to work on to yourself.
>
>
> -Rui
>
> On Fri, Jan 24, 2020 at 1:22 PM Xinbin Huang 
> wrote:
>
>> Hi Rui,
>>
>> Yes, I would like to contribute to Apache Beam, but I don't have a
>> specific topic of interest in mind.
>>
>> I have reviewed some of the issues on JIRA, and would like to work on
>> some of them. I have read through the contributing page
>> https://beam.apache.org/contribute/ and it gives me an idea about the
>> desired workflow. Besides that, are there any other sources I should refer
>> to?
>>
>> I will open a separate email to get permission on JIRA.
>>
>> Cheers.
>> Bin
>>
>> On Wed, Jan 15, 2020 at 1:16 PM Rui Wang  wrote:
>>
>>> Hi Xinbin,
>>>
>>> I assume you want to contribute to Apache Beam while you are less
>>> experienced, thus you want to seek for some mentorship?
>>>
>>> This topic was discussed before. I don't think we decided to build a
>>> formal mentorship program for Beam. Instead, would you share your interest
>>> first and then probably we could ask if there are people that know the
>>> topic who can actually mentor?
>>>
>>>
>>> -Rui
>>>
>>> On Wed, Jan 15, 2020 at 9:30 AM Xinbin Huang 
>>> wrote:
>>>
 Hi community,

 I am pretty new to the apache beam community and want to contribute to
 the project. I think GCOS is a great opportunity for people to learn and
 contribute, but I am not eligible for it because I am not a student. That
 being said, would that be opportunities for non-students to participate in
 this or other opportunities that is suitable for less experienced people
 that want to contribute?

 Thanks!
 Bin

 On Wed, Jan 15, 2020 at 8:52 AM Ismaël Mejía  wrote:

> Thanks for bringing this info. +1 on the Nexmark + Python +
> Portability project.
> Let's sync on that one Pablo. I am interested on co-mentoring it.
>
>
> On Tue, Jan 14, 2020 at 7:55 PM Rui Wang  wrote:
>
>> Great! I will try to propose something for BeamSQL.
>>
>>
>> -Rui
>>
>> On Tue, Jan 14, 2020 at 10:40 AM Pablo Estrada 
>> wrote:
>>
>>> Hello everyone,
>>>
>>> As with every year, the Google Summer of Code has been announced[1],
>>> so we can start preparing for it if anyone is interested. It's early in 
>>> the
>>> process for now, but it's good to prepare early : )
>>>
>>> Here are the ASF mentor guidelines[2]. For now, the thing to do is
>>> to file JIRA issues for your projects, and apply the labels "mentor",
>>> "gsoc", "gsoc2020".
>>>
>>> When the time comes, the next steps are to join the
>>> ment...@community.apache.org list, and request the PMC for approval
>>> of a project.
>>>
>>> My current plan is to have these projects, though these are subject
>>> to change:
>>> - Build Nexmark pipelines for Python SDK (Ismael FYI)
>>> - Azure Blobstore File System for Java & Python
>>>
>>> I'll try to keep the dev@ list updated with other steps of the
>>> process.
>>> Thanks!
>>> -P.
>>>
>>> [1] https://summerofcode.withgoogle.com/
>>> [2]
>>> https://community.apache.org/gsoc.html#prospective-asf-mentors-read-this
>>>
>>

-- 
This message contains confidential information and is intended only for the 
individuals named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. E-mail transmission cannot be 
guaranteed to be secure or error-free as information could be intercepted, 
corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. 
The sender therefore does not accept liability for any errors or omissions 
in the contents of this message, which arise as a result of e-mail 
transmission. If verification is required, please request a hard-copy 
version. -Hazelcast