Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

2018-09-07 Thread Yifan Zou
Thanks all for comments and suggestions. We want to close this thread and
start implementing the new policy based on the discussion:

1. Stop assigning JIRAs to the first person listed in the dependency owners
files . Instead, cc
people on the owner list.
2. We will be creating JIRAs for upgrading individual dependencies, not for
upgrading to specific versions of those dependencies. For example, if a
given dependency X is three minor versions or an year behind we will create
a JIRA for upgrading that. But the specific version to upgrade to has to be
determined by the Beam community. Beam community might choose to close a
JIRA if there are known issues with available recent releases. Tool will
reopen such a closed JIRA to inform owners if Beam is hitting the 'fixed
version' or 3 new versions of the dependency have been released since JIRA
was closed.

Thank you.

Regards.
Yifan

On Wed, Sep 5, 2018 at 2:14 PM Yifan Zou  wrote:

> +1 on the jira "fix version".
> The release frequency of dependencies are various, so that using new
> information such as versions from the Jira closing date to reopen the
> issues might not be very proper. We could check the fix versions first, and
> if specified, then reopen the issue in that version's release cycle; it
> not, follow Cham's proposal (2).
>
> On Wed, Sep 5, 2018 at 1:59 PM Chamikara Jayalath 
> wrote:
>
>>
>>
>> On Wed, Sep 5, 2018 at 12:50 PM Tim Robertson 
>> wrote:
>>
>>> Thank you Cham, and everyone for contributing
>>>
>>> Sorry for slow reply to a thread I started, but I've been swamped on non
>>> Beam projects.
>>>
>>> KafkaIO's policy of 'let the user decide exact version at runtime' has
 been quite useful so far. How feasible is that for other connectors?
>>>
>>>
>>> I presume shimming might be needed in a few places but it's certainly
>>> something we might want to explore more. I'll look into KafkaIO.
>>>
>>> On Cham's proposal :
>>>
>>> (1) +0.5. We can always then opt to either assign or take ownership of
>>> an issue, although I am also happy to stick with the owners model - it
>>> prompted me to investigate and resulted in this thread.
>>>
>>> (2) I think this makes sense.
>>> A bot informing us that we're falling behind versions is immensely
>>> useful as long as we can link issues to others which might have a wider
>>> discussion (remember many dependencies need to be treated together such as
>>> "Support Hadoop 3.0.x" or "Support HBase 2.x"). Would it make sense to let
>>> owners use the Jira "fix versions" to put in future release to inform the
>>> bot when it should start alerting again?
>>>
>>
>> I think this makes sense. Setting a "fix version" will be specially
>> useful for dependency changes that result in API changes that have to be
>> postponed till next major version of Beam.
>>
>> On grouping, I believe we already group JIRAs into tasks and sub-tasks
>> based on group ids of dependencies. I suppose it will not be too hard to
>> close multiple sub-tasks with the same reasoning.
>>
>>
>>>
>>>
>>> On Wed, Sep 5, 2018 at 3:18 AM Yifan Zou  wrote:
>>>
 Thanks Cham for putting this together. Also, after modifying the
 dependency tool based on the policy above, we will close all existing JIRA
 issues that prevent creating duplicate bugs and stop pushing assignees to
 upgrade dependencies with old bugs.

 Please let us know if you have any comments on the revised policy in
 Cham's email.

 Thanks all.

 Regards.
 Yifan Zou

 On Tue, Sep 4, 2018 at 5:35 PM Chamikara Jayalath 
 wrote:

> Based on this email thread and offline feedback from several folks,
> current concerns regarding dependency upgrade policy and tooling seems to
> be following.
>
> (1) We have to be careful when upgrading dependencies. For example, we
> should not create JIRAs for upgrading to dependency versions that have
> known issues.
>
> (2) Dependency owners list can get stale. Somebody who is interested
> in upgrading a dependency today might not be interested in the same task 
> in
> six months. Responsibility of upgrading a dependency should lie with the
> community instead of pre-identified owner(s).
>
> On the other hand we do not want Beam to significantly fall behind
> when it comes to dependencies. We should upgrade dependencies whenever it
> makes sense. This allows us to offer a more up to date system and to makes
> things easy for users that deploy Beam along with other systems.
>
> I discussed these issues with Yifan and we would like to suggest
> following changes to current policy and tooling that might help alleviate
> some of the concerns.
>
> (1) Instead of a dependency "owners" list we will be maintaining an
> "interested parties" list. When we create a JIRA for a dependency we will
> not assign it to an owner but rather we will CC 

Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

2018-09-05 Thread Yifan Zou
+1 on the jira "fix version".
The release frequency of dependencies are various, so that using new
information such as versions from the Jira closing date to reopen the
issues might not be very proper. We could check the fix versions first, and
if specified, then reopen the issue in that version's release cycle; it
not, follow Cham's proposal (2).

On Wed, Sep 5, 2018 at 1:59 PM Chamikara Jayalath 
wrote:

>
>
> On Wed, Sep 5, 2018 at 12:50 PM Tim Robertson 
> wrote:
>
>> Thank you Cham, and everyone for contributing
>>
>> Sorry for slow reply to a thread I started, but I've been swamped on non
>> Beam projects.
>>
>> KafkaIO's policy of 'let the user decide exact version at runtime' has
>>> been quite useful so far. How feasible is that for other connectors?
>>
>>
>> I presume shimming might be needed in a few places but it's certainly
>> something we might want to explore more. I'll look into KafkaIO.
>>
>> On Cham's proposal :
>>
>> (1) +0.5. We can always then opt to either assign or take ownership of an
>> issue, although I am also happy to stick with the owners model - it
>> prompted me to investigate and resulted in this thread.
>>
>> (2) I think this makes sense.
>> A bot informing us that we're falling behind versions is immensely useful
>> as long as we can link issues to others which might have a wider discussion
>> (remember many dependencies need to be treated together such as "Support
>> Hadoop 3.0.x" or "Support HBase 2.x"). Would it make sense to let owners
>> use the Jira "fix versions" to put in future release to inform the bot when
>> it should start alerting again?
>>
>
> I think this makes sense. Setting a "fix version" will be specially useful
> for dependency changes that result in API changes that have to be postponed
> till next major version of Beam.
>
> On grouping, I believe we already group JIRAs into tasks and sub-tasks
> based on group ids of dependencies. I suppose it will not be too hard to
> close multiple sub-tasks with the same reasoning.
>
>
>>
>>
>> On Wed, Sep 5, 2018 at 3:18 AM Yifan Zou  wrote:
>>
>>> Thanks Cham for putting this together. Also, after modifying the
>>> dependency tool based on the policy above, we will close all existing JIRA
>>> issues that prevent creating duplicate bugs and stop pushing assignees to
>>> upgrade dependencies with old bugs.
>>>
>>> Please let us know if you have any comments on the revised policy in
>>> Cham's email.
>>>
>>> Thanks all.
>>>
>>> Regards.
>>> Yifan Zou
>>>
>>> On Tue, Sep 4, 2018 at 5:35 PM Chamikara Jayalath 
>>> wrote:
>>>
 Based on this email thread and offline feedback from several folks,
 current concerns regarding dependency upgrade policy and tooling seems to
 be following.

 (1) We have to be careful when upgrading dependencies. For example, we
 should not create JIRAs for upgrading to dependency versions that have
 known issues.

 (2) Dependency owners list can get stale. Somebody who is interested in
 upgrading a dependency today might not be interested in the same task in
 six months. Responsibility of upgrading a dependency should lie with the
 community instead of pre-identified owner(s).

 On the other hand we do not want Beam to significantly fall behind when
 it comes to dependencies. We should upgrade dependencies whenever it makes
 sense. This allows us to offer a more up to date system and to makes things
 easy for users that deploy Beam along with other systems.

 I discussed these issues with Yifan and we would like to suggest
 following changes to current policy and tooling that might help alleviate
 some of the concerns.

 (1) Instead of a dependency "owners" list we will be maintaining an
 "interested parties" list. When we create a JIRA for a dependency we will
 not assign it to an owner but rather we will CC all the folks that
 mentioned that they will be interested in receiving updates related to that
 dependency. Hope is that some of the interested parties will also put
 forward the effort to upgrade dependencies they are interested in but the
 responsibility of upgrading dependencies lie with the community as a whole.

  (2) We will be creating JIRAs for upgrading individual dependencies,
 not for upgrading to specific versions of those dependencies. For example,
 if a given dependency X is three minor versions or an year behind we will
 create a JIRA for upgrading that. But the specific version to upgrade to
 has to be determined by the Beam community. Beam community might choose to
 close a JIRA if there are known issues with available recent releases. Tool
 may reopen such a closed JIRA in the future if new information becomes
 available (for example, 3 new versions have been released since JIRA was
 closed).

 Thoughts ?

 Thanks,
 Cham

 On Tue, Aug 28, 2018 at 1:51 PM Chamikara Jayalath <
 

Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

2018-09-05 Thread Chamikara Jayalath
On Wed, Sep 5, 2018 at 12:50 PM Tim Robertson 
wrote:

> Thank you Cham, and everyone for contributing
>
> Sorry for slow reply to a thread I started, but I've been swamped on non
> Beam projects.
>
> KafkaIO's policy of 'let the user decide exact version at runtime' has
>> been quite useful so far. How feasible is that for other connectors?
>
>
> I presume shimming might be needed in a few places but it's certainly
> something we might want to explore more. I'll look into KafkaIO.
>
> On Cham's proposal :
>
> (1) +0.5. We can always then opt to either assign or take ownership of an
> issue, although I am also happy to stick with the owners model - it
> prompted me to investigate and resulted in this thread.
>
> (2) I think this makes sense.
> A bot informing us that we're falling behind versions is immensely useful
> as long as we can link issues to others which might have a wider discussion
> (remember many dependencies need to be treated together such as "Support
> Hadoop 3.0.x" or "Support HBase 2.x"). Would it make sense to let owners
> use the Jira "fix versions" to put in future release to inform the bot when
> it should start alerting again?
>

I think this makes sense. Setting a "fix version" will be specially useful
for dependency changes that result in API changes that have to be postponed
till next major version of Beam.

On grouping, I believe we already group JIRAs into tasks and sub-tasks
based on group ids of dependencies. I suppose it will not be too hard to
close multiple sub-tasks with the same reasoning.


>
>
> On Wed, Sep 5, 2018 at 3:18 AM Yifan Zou  wrote:
>
>> Thanks Cham for putting this together. Also, after modifying the
>> dependency tool based on the policy above, we will close all existing JIRA
>> issues that prevent creating duplicate bugs and stop pushing assignees to
>> upgrade dependencies with old bugs.
>>
>> Please let us know if you have any comments on the revised policy in
>> Cham's email.
>>
>> Thanks all.
>>
>> Regards.
>> Yifan Zou
>>
>> On Tue, Sep 4, 2018 at 5:35 PM Chamikara Jayalath 
>> wrote:
>>
>>> Based on this email thread and offline feedback from several folks,
>>> current concerns regarding dependency upgrade policy and tooling seems to
>>> be following.
>>>
>>> (1) We have to be careful when upgrading dependencies. For example, we
>>> should not create JIRAs for upgrading to dependency versions that have
>>> known issues.
>>>
>>> (2) Dependency owners list can get stale. Somebody who is interested in
>>> upgrading a dependency today might not be interested in the same task in
>>> six months. Responsibility of upgrading a dependency should lie with the
>>> community instead of pre-identified owner(s).
>>>
>>> On the other hand we do not want Beam to significantly fall behind when
>>> it comes to dependencies. We should upgrade dependencies whenever it makes
>>> sense. This allows us to offer a more up to date system and to makes things
>>> easy for users that deploy Beam along with other systems.
>>>
>>> I discussed these issues with Yifan and we would like to suggest
>>> following changes to current policy and tooling that might help alleviate
>>> some of the concerns.
>>>
>>> (1) Instead of a dependency "owners" list we will be maintaining an
>>> "interested parties" list. When we create a JIRA for a dependency we will
>>> not assign it to an owner but rather we will CC all the folks that
>>> mentioned that they will be interested in receiving updates related to that
>>> dependency. Hope is that some of the interested parties will also put
>>> forward the effort to upgrade dependencies they are interested in but the
>>> responsibility of upgrading dependencies lie with the community as a whole.
>>>
>>>  (2) We will be creating JIRAs for upgrading individual dependencies,
>>> not for upgrading to specific versions of those dependencies. For example,
>>> if a given dependency X is three minor versions or an year behind we will
>>> create a JIRA for upgrading that. But the specific version to upgrade to
>>> has to be determined by the Beam community. Beam community might choose to
>>> close a JIRA if there are known issues with available recent releases. Tool
>>> may reopen such a closed JIRA in the future if new information becomes
>>> available (for example, 3 new versions have been released since JIRA was
>>> closed).
>>>
>>> Thoughts ?
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Tue, Aug 28, 2018 at 1:51 PM Chamikara Jayalath 
>>> wrote:
>>>


 On Tue, Aug 28, 2018 at 12:05 PM Thomas Weise  wrote:

> I think there is an invalid assumption being made in this discussion,
> which is that most projects comply with semantic versioning. The reality 
> in
> the open source big data space is unfortunately quite different. Ismaël 
> has
> well characterized the situation and HBase isn't an exception. Another
> indicator for the scale of problem is extensive amount of shading used in
> Beam and other projects. It 

Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

2018-09-05 Thread Tim Robertson
Thank you Cham, and everyone for contributing

Sorry for slow reply to a thread I started, but I've been swamped on non
Beam projects.

KafkaIO's policy of 'let the user decide exact version at runtime' has been
> quite useful so far. How feasible is that for other connectors?


I presume shimming might be needed in a few places but it's certainly
something we might want to explore more. I'll look into KafkaIO.

On Cham's proposal :

(1) +0.5. We can always then opt to either assign or take ownership of an
issue, although I am also happy to stick with the owners model - it
prompted me to investigate and resulted in this thread.

(2) I think this makes sense.
A bot informing us that we're falling behind versions is immensely useful
as long as we can link issues to others which might have a wider discussion
(remember many dependencies need to be treated together such as "Support
Hadoop 3.0.x" or "Support HBase 2.x"). Would it make sense to let owners
use the Jira "fix versions" to put in future release to inform the bot when
it should start alerting again?



On Wed, Sep 5, 2018 at 3:18 AM Yifan Zou  wrote:

> Thanks Cham for putting this together. Also, after modifying the
> dependency tool based on the policy above, we will close all existing JIRA
> issues that prevent creating duplicate bugs and stop pushing assignees to
> upgrade dependencies with old bugs.
>
> Please let us know if you have any comments on the revised policy in
> Cham's email.
>
> Thanks all.
>
> Regards.
> Yifan Zou
>
> On Tue, Sep 4, 2018 at 5:35 PM Chamikara Jayalath 
> wrote:
>
>> Based on this email thread and offline feedback from several folks,
>> current concerns regarding dependency upgrade policy and tooling seems to
>> be following.
>>
>> (1) We have to be careful when upgrading dependencies. For example, we
>> should not create JIRAs for upgrading to dependency versions that have
>> known issues.
>>
>> (2) Dependency owners list can get stale. Somebody who is interested in
>> upgrading a dependency today might not be interested in the same task in
>> six months. Responsibility of upgrading a dependency should lie with the
>> community instead of pre-identified owner(s).
>>
>> On the other hand we do not want Beam to significantly fall behind when
>> it comes to dependencies. We should upgrade dependencies whenever it makes
>> sense. This allows us to offer a more up to date system and to makes things
>> easy for users that deploy Beam along with other systems.
>>
>> I discussed these issues with Yifan and we would like to suggest
>> following changes to current policy and tooling that might help alleviate
>> some of the concerns.
>>
>> (1) Instead of a dependency "owners" list we will be maintaining an
>> "interested parties" list. When we create a JIRA for a dependency we will
>> not assign it to an owner but rather we will CC all the folks that
>> mentioned that they will be interested in receiving updates related to that
>> dependency. Hope is that some of the interested parties will also put
>> forward the effort to upgrade dependencies they are interested in but the
>> responsibility of upgrading dependencies lie with the community as a whole.
>>
>>  (2) We will be creating JIRAs for upgrading individual dependencies, not
>> for upgrading to specific versions of those dependencies. For example, if a
>> given dependency X is three minor versions or an year behind we will create
>> a JIRA for upgrading that. But the specific version to upgrade to has to be
>> determined by the Beam community. Beam community might choose to close a
>> JIRA if there are known issues with available recent releases. Tool may
>> reopen such a closed JIRA in the future if new information becomes
>> available (for example, 3 new versions have been released since JIRA was
>> closed).
>>
>> Thoughts ?
>>
>> Thanks,
>> Cham
>>
>> On Tue, Aug 28, 2018 at 1:51 PM Chamikara Jayalath 
>> wrote:
>>
>>>
>>>
>>> On Tue, Aug 28, 2018 at 12:05 PM Thomas Weise  wrote:
>>>
 I think there is an invalid assumption being made in this discussion,
 which is that most projects comply with semantic versioning. The reality in
 the open source big data space is unfortunately quite different. Ismaël has
 well characterized the situation and HBase isn't an exception. Another
 indicator for the scale of problem is extensive amount of shading used in
 Beam and other projects. It wouldn't be necessary if semver compliance was
 something we can rely on.

 Our recent Flink upgrade broke user(s). And we noticed a backward
 incompatible Flink change that affected the portable Flink runner even
 between patches.

 Many projects (including Beam) guarantee compatibility only for a
 subset of public API. Sometimes a REST API is not covered, sometimes not
 strictly internal protocols change and so on, all of which can break users,
 despite the public API remaining "compatible". As much as I would love to
 rely on 

Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

2018-09-04 Thread Yifan Zou
Thanks Cham for putting this together. Also, after modifying the dependency
tool based on the policy above, we will close all existing JIRA issues that
prevent creating duplicate bugs and stop pushing assignees to upgrade
dependencies with old bugs.

Please let us know if you have any comments on the revised policy in Cham's
email.

Thanks all.

Regards.
Yifan Zou

On Tue, Sep 4, 2018 at 5:35 PM Chamikara Jayalath 
wrote:

> Based on this email thread and offline feedback from several folks,
> current concerns regarding dependency upgrade policy and tooling seems to
> be following.
>
> (1) We have to be careful when upgrading dependencies. For example, we
> should not create JIRAs for upgrading to dependency versions that have
> known issues.
>
> (2) Dependency owners list can get stale. Somebody who is interested in
> upgrading a dependency today might not be interested in the same task in
> six months. Responsibility of upgrading a dependency should lie with the
> community instead of pre-identified owner(s).
>
> On the other hand we do not want Beam to significantly fall behind when it
> comes to dependencies. We should upgrade dependencies whenever it makes
> sense. This allows us to offer a more up to date system and to makes things
> easy for users that deploy Beam along with other systems.
>
> I discussed these issues with Yifan and we would like to suggest following
> changes to current policy and tooling that might help alleviate some of the
> concerns.
>
> (1) Instead of a dependency "owners" list we will be maintaining an
> "interested parties" list. When we create a JIRA for a dependency we will
> not assign it to an owner but rather we will CC all the folks that
> mentioned that they will be interested in receiving updates related to that
> dependency. Hope is that some of the interested parties will also put
> forward the effort to upgrade dependencies they are interested in but the
> responsibility of upgrading dependencies lie with the community as a whole.
>
>  (2) We will be creating JIRAs for upgrading individual dependencies, not
> for upgrading to specific versions of those dependencies. For example, if a
> given dependency X is three minor versions or an year behind we will create
> a JIRA for upgrading that. But the specific version to upgrade to has to be
> determined by the Beam community. Beam community might choose to close a
> JIRA if there are known issues with available recent releases. Tool may
> reopen such a closed JIRA in the future if new information becomes
> available (for example, 3 new versions have been released since JIRA was
> closed).
>
> Thoughts ?
>
> Thanks,
> Cham
>
> On Tue, Aug 28, 2018 at 1:51 PM Chamikara Jayalath 
> wrote:
>
>>
>>
>> On Tue, Aug 28, 2018 at 12:05 PM Thomas Weise  wrote:
>>
>>> I think there is an invalid assumption being made in this discussion,
>>> which is that most projects comply with semantic versioning. The reality in
>>> the open source big data space is unfortunately quite different. Ismaël has
>>> well characterized the situation and HBase isn't an exception. Another
>>> indicator for the scale of problem is extensive amount of shading used in
>>> Beam and other projects. It wouldn't be necessary if semver compliance was
>>> something we can rely on.
>>>
>>> Our recent Flink upgrade broke user(s). And we noticed a backward
>>> incompatible Flink change that affected the portable Flink runner even
>>> between patches.
>>>
>>> Many projects (including Beam) guarantee compatibility only for a subset
>>> of public API. Sometimes a REST API is not covered, sometimes not strictly
>>> internal protocols change and so on, all of which can break users, despite
>>> the public API remaining "compatible". As much as I would love to rely on
>>> the version number to tell me wether an upgrade is safe or not, that's not
>>> practically possible.
>>>
>>> Furthermore, we need to proceed with caution forcing upgrades on users
>>> that host the target systems. To stay with the Flink example, moving Beam
>>> from 1.4 to 1.5 is actually a major change to some, because they now have
>>> to upgrade their Flink clusters/deployments to be able to use the new
>>> version of Beam.
>>>
>>> Upgrades need to be done with caution and may require extensive
>>> verification beyond what our automation provides. I think the Spark change
>>> from 1.x to 2.x and also the JDK 1.8 change were good examples, they
>>> provided the community a window to provide feedback and influence the
>>> change.
>>>
>>
>> Thanks for the clarification.
>>
>> Current policy indeed requests caution and explicit checks when upgrading
>> all dependencies (including minor and patch versions) but language might
>> have to be updated to emphasize your concerns.
>>
>> Here's the current text.
>>
>> "Beam releases adhere to  
>> semantic
>> versioning. Hence, community members should take care when updating
>> dependencies. Minor version updates 

Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

2018-08-28 Thread Chamikara Jayalath
On Tue, Aug 28, 2018 at 12:05 PM Thomas Weise  wrote:

> I think there is an invalid assumption being made in this discussion,
> which is that most projects comply with semantic versioning. The reality in
> the open source big data space is unfortunately quite different. Ismaël has
> well characterized the situation and HBase isn't an exception. Another
> indicator for the scale of problem is extensive amount of shading used in
> Beam and other projects. It wouldn't be necessary if semver compliance was
> something we can rely on.
>
> Our recent Flink upgrade broke user(s). And we noticed a backward
> incompatible Flink change that affected the portable Flink runner even
> between patches.
>
> Many projects (including Beam) guarantee compatibility only for a subset
> of public API. Sometimes a REST API is not covered, sometimes not strictly
> internal protocols change and so on, all of which can break users, despite
> the public API remaining "compatible". As much as I would love to rely on
> the version number to tell me wether an upgrade is safe or not, that's not
> practically possible.
>
> Furthermore, we need to proceed with caution forcing upgrades on users
> that host the target systems. To stay with the Flink example, moving Beam
> from 1.4 to 1.5 is actually a major change to some, because they now have
> to upgrade their Flink clusters/deployments to be able to use the new
> version of Beam.
>
> Upgrades need to be done with caution and may require extensive
> verification beyond what our automation provides. I think the Spark change
> from 1.x to 2.x and also the JDK 1.8 change were good examples, they
> provided the community a window to provide feedback and influence the
> change.
>

Thanks for the clarification.

Current policy indeed requests caution and explicit checks when upgrading
all dependencies (including minor and patch versions) but language might
have to be updated to emphasize your concerns.

Here's the current text.

"Beam releases adhere to
 semantic
versioning. Hence, community members should take care when updating
dependencies. Minor version updates to dependencies should be backwards
compatible in most cases. Some updates to dependencies though may result in
backwards incompatible API or functionality changes to Beam. PR reviewers
and committers should take care to detect any dependency updates that could
potentially introduce backwards incompatible changes to Beam before merging
and PRs that update dependencies should include a statement regarding this
verification in the form of a PR comment. Dependency updates that result in
backwards incompatible changes to non-experimental features of Beam should
be held till next major version release of Beam. Any exceptions to this
policy should only occur in extreme cases (for example, due to a security
vulnerability of an existing dependency that is only fixed in a subsequent
major version) and should be discussed in the Beam dev list. Note that
backwards incompatible changes to experimental features may be introduced
in a minor version release."

Also, are there any other steps we can take to make sure that Beam
dependencies are not too old while offering a stable system ? Note that
having a lot of legacy dependencies that do not get upgraded regularly can
also result in user pain and Beam being unusable for certain users who run
into dependency conflicts when using Beam along with other systems (which
will increase the amount of shading/vendoring we have to do).

Please note that current tooling does not force upgrades or automatically
upgrade dependencies. It simply creates JIRAs that can be closed with a
reason if needed. For Python SDK though we have version ranges in place for
most dependencies [1] so these dependencies get updated automatically
according to the corresponding ranges.
https://github.com/apache/beam/blob/master/sdks/python/setup.py#L103

Thanks,
Cham


>
> Thanks,
> Thomas
>
>
>
> On Tue, Aug 28, 2018 at 11:29 AM Raghu Angadi  wrote:
>
>> Thanks for the IO versioning summary.
>> KafkaIO's policy of 'let the user decide exact version at runtime' has
>> been quite useful so far. How feasible is that for other connectors?
>>
>> Also, KafkaIO does not limit itself to minimum features available across
>> all the supported versions. Some of the features (e.g. server side
>> timestamps) are disabled based on runtime Kafka version.  The unit tests
>> currently run with single recent version. Integration tests could certainly
>> use multiple versions. With some more effort in writing tests, we could
>> make multiple versions of the unit tests.
>>
>> Raghu.
>>
>> IO versioning
>>> * Elasticsearch. We delayed the move to version 6 until we heard of
>>> more active users needing it (more deployments). We support 2.x and
>>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
>>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
>>> because most big 

Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

2018-08-28 Thread Thomas Weise
I think there is an invalid assumption being made in this discussion, which
is that most projects comply with semantic versioning. The reality in the
open source big data space is unfortunately quite different. Ismaël has
well characterized the situation and HBase isn't an exception. Another
indicator for the scale of problem is extensive amount of shading used in
Beam and other projects. It wouldn't be necessary if semver compliance was
something we can rely on.

Our recent Flink upgrade broke user(s). And we noticed a backward
incompatible Flink change that affected the portable Flink runner even
between patches.

Many projects (including Beam) guarantee compatibility only for a subset of
public API. Sometimes a REST API is not covered, sometimes not strictly
internal protocols change and so on, all of which can break users, despite
the public API remaining "compatible". As much as I would love to rely on
the version number to tell me wether an upgrade is safe or not, that's not
practically possible.

Furthermore, we need to proceed with caution forcing upgrades on users that
host the target systems. To stay with the Flink example, moving Beam from
1.4 to 1.5 is actually a major change to some, because they now have to
upgrade their Flink clusters/deployments to be able to use the new version
of Beam.

Upgrades need to be done with caution and may require extensive
verification beyond what our automation provides. I think the Spark change
from 1.x to 2.x and also the JDK 1.8 change were good examples, they
provided the community a window to provide feedback and influence the
change.

Thanks,
Thomas



On Tue, Aug 28, 2018 at 11:29 AM Raghu Angadi  wrote:

> Thanks for the IO versioning summary.
> KafkaIO's policy of 'let the user decide exact version at runtime' has
> been quite useful so far. How feasible is that for other connectors?
>
> Also, KafkaIO does not limit itself to minimum features available across
> all the supported versions. Some of the features (e.g. server side
> timestamps) are disabled based on runtime Kafka version.  The unit tests
> currently run with single recent version. Integration tests could certainly
> use multiple versions. With some more effort in writing tests, we could
> make multiple versions of the unit tests.
>
> Raghu.
>
> IO versioning
>> * Elasticsearch. We delayed the move to version 6 until we heard of
>> more active users needing it (more deployments). We support 2.x and
>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
>> because most big data distributions still use 5.x (however 5.x has
>> been EOL).
>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
>> most of the deployments of Kafka use earlier versions than 1.x. This
>> module uses a single version with the kafka client as a provided
>> dependency and so far it works (but we don’t have multi version
>> tests).
>>
>
>
> On Tue, Aug 28, 2018 at 8:38 AM Ismaël Mejía  wrote:
>
>> I think we should refine the strategy on dependencies discussed
>> recently. Sorry to come late with this (I did not follow closely the
>> previous discussion), but the current approach is clearly not in line
>> with the industry reality (at least not for IO connectors + Hadoop +
>> Spark/Flink use).
>>
>> A really proactive approach to dependency updates is a good practice
>> for the core dependencies we have e.g. Guava, Bytebuddy, Avro,
>> Protobuf, etc, and of course for the case of cloud based IOs e.g. GCS,
>> Bigquery, AWS S3, etc. However when we talk about self hosted data
>> sources or processing systems this gets more complicated and I think
>> we should be more flexible and do this case by case (and remove these
>> from the auto update email reminder).
>>
>> Some open source projects have at least three maintained versions:
>> - LTS – maps to what most of the people have installed (or the big
>> data distributions use) e.g. HBase 1.1.x, Hadoop 2.6.x
>> - Stable – current recommended version. HBase 1.4.x, Hadoop 2.8.x
>> - Next – latest release. HBase 2.1.x Hadoop 3.1.x
>>
>> Following the most recent versions can be good to be close to the
>> current development of other projects and some of the fixes, but these
>> versions are commonly not deployed for most users and adopting a LTS
>> or stable only approach won't satisfy all cases either. To understand
>> why this is complex let’s see some historical issues:
>>
>> IO versioning
>> * Elasticsearch. We delayed the move to version 6 until we heard of
>> more active users needing it (more deployments). We support 2.x and
>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
>> because most big data distributions still use 5.x (however 5.x has
>> been EOL).
>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
>> most of the deployments of Kafka use earlier versions than 1.x. 

Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

2018-08-28 Thread Raghu Angadi
Thanks for the IO versioning summary.
KafkaIO's policy of 'let the user decide exact version at runtime' has been
quite useful so far. How feasible is that for other connectors?

Also, KafkaIO does not limit itself to minimum features available across
all the supported versions. Some of the features (e.g. server side
timestamps) are disabled based on runtime Kafka version.  The unit tests
currently run with single recent version. Integration tests could certainly
use multiple versions. With some more effort in writing tests, we could
make multiple versions of the unit tests.

Raghu.

IO versioning
> * Elasticsearch. We delayed the move to version 6 until we heard of
> more active users needing it (more deployments). We support 2.x and
> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
> because most big data distributions still use 5.x (however 5.x has
> been EOL).
> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
> most of the deployments of Kafka use earlier versions than 1.x. This
> module uses a single version with the kafka client as a provided
> dependency and so far it works (but we don’t have multi version
> tests).
>


On Tue, Aug 28, 2018 at 8:38 AM Ismaël Mejía  wrote:

> I think we should refine the strategy on dependencies discussed
> recently. Sorry to come late with this (I did not follow closely the
> previous discussion), but the current approach is clearly not in line
> with the industry reality (at least not for IO connectors + Hadoop +
> Spark/Flink use).
>
> A really proactive approach to dependency updates is a good practice
> for the core dependencies we have e.g. Guava, Bytebuddy, Avro,
> Protobuf, etc, and of course for the case of cloud based IOs e.g. GCS,
> Bigquery, AWS S3, etc. However when we talk about self hosted data
> sources or processing systems this gets more complicated and I think
> we should be more flexible and do this case by case (and remove these
> from the auto update email reminder).
>
> Some open source projects have at least three maintained versions:
> - LTS – maps to what most of the people have installed (or the big
> data distributions use) e.g. HBase 1.1.x, Hadoop 2.6.x
> - Stable – current recommended version. HBase 1.4.x, Hadoop 2.8.x
> - Next – latest release. HBase 2.1.x Hadoop 3.1.x
>
> Following the most recent versions can be good to be close to the
> current development of other projects and some of the fixes, but these
> versions are commonly not deployed for most users and adopting a LTS
> or stable only approach won't satisfy all cases either. To understand
> why this is complex let’s see some historical issues:
>
> IO versioning
> * Elasticsearch. We delayed the move to version 6 until we heard of
> more active users needing it (more deployments). We support 2.x and
> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
> because most big data distributions still use 5.x (however 5.x has
> been EOL).
> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
> most of the deployments of Kafka use earlier versions than 1.x. This
> module uses a single version with the kafka client as a provided
> dependency and so far it works (but we don’t have multi version
> tests).
>
> Runners versioning
> * The move to Spark 1 to Spark 2 was decided after evaluating the
> tradeoffs between maintaining multiple version support and to have
> breaking changes with the issues of maintaining multiple versions.
> This is a rare case but also with consequences. This dependency is
> provided but we don't actively test issues on version migration.
> * Flink moved to version 1.5, introducing incompatibility in
> checkpointing (discussed recently and with not yet consensus on how to
> handle).
>
> As you can see, it seems really hard to have a solution that fits all
> cases. Probably the only rule that I see from this list is that we
> should upgrade versions for connectors that have been deprecated or
> arrived to the EOL (e.g. Solr 5.x, Elasticsearch 2.x).
>
> For the case of the provided dependencies I wonder if as part of the
> tests we should provide tests with multiple versions (note that this
> is currently blocked by BEAM-4087).
>
> Any other ideas or opinions to see how we can handle this? What other
> people in the community think ? (Notice that this can have relation
> with the ongoing LTS discussion.
>
>
> On Tue, Aug 28, 2018 at 10:44 AM Tim Robertson
>  wrote:
> >
> > Hi folks,
> >
> > I'd like to revisit the discussion around our versioning policy
> specifically for the Hadoop ecosystem and make sure we are aware of the
> implications.
> >
> > As an example our policy today would have us on HBase 2.1 and I have
> reminders to address this.
> >
> > However, currently the versions of HBase in the major hadoop distros are:
> >
> >  - Cloudera 5 on HBase 1.2 

Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

2018-08-28 Thread Chamikara Jayalath
Constrains to existing dependencies is a valid concern and we do not have a
good solution for this currently. One way to handle this is be simply to
close automatically created JIRAs with a comment and the tool will not try
to create further JIRAs for the same dependency after this. But we should
be able to come up with a better solution. In addition to this, integration
tests should also help to make sure that we do not upgrade these
dependencies breaking Beam components accidentally.

Thanks,
Cham

On Tue, Aug 28, 2018 at 10:45 AM Andrew Pilloud  wrote:

> The Beam SQL module faces similar problems, several of our dependencies
> are constrained by maintaining compatibility with versions used by Calcite.
> We've written tests to detect some of these incompatibilities. Could we add
> integration tests for these major hadoop distros that ensure we maintain
> compatibility rather then explicitly calling them out in our upgrade policy?
>
> Andrew
>
> On Tue, Aug 28, 2018 at 10:31 AM Chamikara Jayalath 
> wrote:
>
>> Thanks Tim for raising this and Thanks JB and Ismaël for all the great
>> points.
>>
>> I agree that one size fit all solution will not work when it comes to
>> dependencies. Based on past examples, clearly there are many cases where we
>> should proceed with caution and upgrade dependencies with care.
>>
>> That said, given that Beam respects semantic versioning and most of our
>> dependencies respect semantic versioning I think we should be able to
>> upgrade most minor (and patch) versions of dependencies with relative ease.
>> Current policy is to automatically create JIRAs if we are more than three
>> minor versions behind. I'm not sure if HBase respects semantic versioning.
>> If it does not, I think, it should be the exception not the norm.
>>
>> When it comes major version upgrades though we'll have to proceed with
>> caution. In addition to all the case-by-case reasoning Ismaël gave above
>> there's also the real possibility of a major version upgrade changing Beam
>> API (syntax or semantics) in a non backwards compatible way and breaking
>> the backwards compatibility guarantee offered by Beam. Current dependency
>> policy [1] try to capture this in a separate section and requires all PRs
>> that upgrade dependencies to contain a statement regarding backwards
>> compatibility.
>>
>> I agree that there might be many modifications we have to make to
>> existing policies when it comes to upgrading Beam dependencies in according
>> to industry standards. Current policies are there as a first version for us
>> to try out. We should definitely time to time reevaluate and update the
>> policies as needed. I'm also extremely eager to hear what others in the
>> community think about this.
>>
>> Thanks,
>> Cham
>>
>> [1] https://beam.apache.org/contribute/dependencies/
>>
>> On Tue, Aug 28, 2018 at 8:38 AM Ismaël Mejía  wrote:
>>
>>> I think we should refine the strategy on dependencies discussed
>>> recently. Sorry to come late with this (I did not follow closely the
>>> previous discussion), but the current approach is clearly not in line
>>> with the industry reality (at least not for IO connectors + Hadoop +
>>> Spark/Flink use).
>>>
>>> A really proactive approach to dependency updates is a good practice
>>> for the core dependencies we have e.g. Guava, Bytebuddy, Avro,
>>> Protobuf, etc, and of course for the case of cloud based IOs e.g. GCS,
>>> Bigquery, AWS S3, etc. However when we talk about self hosted data
>>> sources or processing systems this gets more complicated and I think
>>> we should be more flexible and do this case by case (and remove these
>>> from the auto update email reminder).
>>>
>>> Some open source projects have at least three maintained versions:
>>> - LTS – maps to what most of the people have installed (or the big
>>> data distributions use) e.g. HBase 1.1.x, Hadoop 2.6.x
>>> - Stable – current recommended version. HBase 1.4.x, Hadoop 2.8.x
>>> - Next – latest release. HBase 2.1.x Hadoop 3.1.x
>>>
>>> Following the most recent versions can be good to be close to the
>>> current development of other projects and some of the fixes, but these
>>> versions are commonly not deployed for most users and adopting a LTS
>>> or stable only approach won't satisfy all cases either. To understand
>>> why this is complex let’s see some historical issues:
>>>
>>> IO versioning
>>> * Elasticsearch. We delayed the move to version 6 until we heard of
>>> more active users needing it (more deployments). We support 2.x and
>>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
>>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
>>> because most big data distributions still use 5.x (however 5.x has
>>> been EOL).
>>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
>>> most of the deployments of Kafka use earlier versions than 1.x. This
>>> module uses a single version with the kafka client as a provided
>>> dependency and so far it 

Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

2018-08-28 Thread Andrew Pilloud
The Beam SQL module faces similar problems, several of our dependencies are
constrained by maintaining compatibility with versions used by Calcite.
We've written tests to detect some of these incompatibilities. Could we add
integration tests for these major hadoop distros that ensure we maintain
compatibility rather then explicitly calling them out in our upgrade policy?

Andrew

On Tue, Aug 28, 2018 at 10:31 AM Chamikara Jayalath 
wrote:

> Thanks Tim for raising this and Thanks JB and Ismaël for all the great
> points.
>
> I agree that one size fit all solution will not work when it comes to
> dependencies. Based on past examples, clearly there are many cases where we
> should proceed with caution and upgrade dependencies with care.
>
> That said, given that Beam respects semantic versioning and most of our
> dependencies respect semantic versioning I think we should be able to
> upgrade most minor (and patch) versions of dependencies with relative ease.
> Current policy is to automatically create JIRAs if we are more than three
> minor versions behind. I'm not sure if HBase respects semantic versioning.
> If it does not, I think, it should be the exception not the norm.
>
> When it comes major version upgrades though we'll have to proceed with
> caution. In addition to all the case-by-case reasoning Ismaël gave above
> there's also the real possibility of a major version upgrade changing Beam
> API (syntax or semantics) in a non backwards compatible way and breaking
> the backwards compatibility guarantee offered by Beam. Current dependency
> policy [1] try to capture this in a separate section and requires all PRs
> that upgrade dependencies to contain a statement regarding backwards
> compatibility.
>
> I agree that there might be many modifications we have to make to existing
> policies when it comes to upgrading Beam dependencies in according to
> industry standards. Current policies are there as a first version for us to
> try out. We should definitely time to time reevaluate and update the
> policies as needed. I'm also extremely eager to hear what others in the
> community think about this.
>
> Thanks,
> Cham
>
> [1] https://beam.apache.org/contribute/dependencies/
>
> On Tue, Aug 28, 2018 at 8:38 AM Ismaël Mejía  wrote:
>
>> I think we should refine the strategy on dependencies discussed
>> recently. Sorry to come late with this (I did not follow closely the
>> previous discussion), but the current approach is clearly not in line
>> with the industry reality (at least not for IO connectors + Hadoop +
>> Spark/Flink use).
>>
>> A really proactive approach to dependency updates is a good practice
>> for the core dependencies we have e.g. Guava, Bytebuddy, Avro,
>> Protobuf, etc, and of course for the case of cloud based IOs e.g. GCS,
>> Bigquery, AWS S3, etc. However when we talk about self hosted data
>> sources or processing systems this gets more complicated and I think
>> we should be more flexible and do this case by case (and remove these
>> from the auto update email reminder).
>>
>> Some open source projects have at least three maintained versions:
>> - LTS – maps to what most of the people have installed (or the big
>> data distributions use) e.g. HBase 1.1.x, Hadoop 2.6.x
>> - Stable – current recommended version. HBase 1.4.x, Hadoop 2.8.x
>> - Next – latest release. HBase 2.1.x Hadoop 3.1.x
>>
>> Following the most recent versions can be good to be close to the
>> current development of other projects and some of the fixes, but these
>> versions are commonly not deployed for most users and adopting a LTS
>> or stable only approach won't satisfy all cases either. To understand
>> why this is complex let’s see some historical issues:
>>
>> IO versioning
>> * Elasticsearch. We delayed the move to version 6 until we heard of
>> more active users needing it (more deployments). We support 2.x and
>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
>> because most big data distributions still use 5.x (however 5.x has
>> been EOL).
>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
>> most of the deployments of Kafka use earlier versions than 1.x. This
>> module uses a single version with the kafka client as a provided
>> dependency and so far it works (but we don’t have multi version
>> tests).
>>
>> Runners versioning
>> * The move to Spark 1 to Spark 2 was decided after evaluating the
>> tradeoffs between maintaining multiple version support and to have
>> breaking changes with the issues of maintaining multiple versions.
>> This is a rare case but also with consequences. This dependency is
>> provided but we don't actively test issues on version migration.
>> * Flink moved to version 1.5, introducing incompatibility in
>> checkpointing (discussed recently and with not yet consensus on how to
>> handle).
>>
>> As you can see, it seems really hard to have a solution that fits all
>> cases. 

Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

2018-08-28 Thread Chamikara Jayalath
Thanks Tim for raising this and Thanks JB and Ismaël for all the great
points.

I agree that one size fit all solution will not work when it comes to
dependencies. Based on past examples, clearly there are many cases where we
should proceed with caution and upgrade dependencies with care.

That said, given that Beam respects semantic versioning and most of our
dependencies respect semantic versioning I think we should be able to
upgrade most minor (and patch) versions of dependencies with relative ease.
Current policy is to automatically create JIRAs if we are more than three
minor versions behind. I'm not sure if HBase respects semantic versioning.
If it does not, I think, it should be the exception not the norm.

When it comes major version upgrades though we'll have to proceed with
caution. In addition to all the case-by-case reasoning Ismaël gave above
there's also the real possibility of a major version upgrade changing Beam
API (syntax or semantics) in a non backwards compatible way and breaking
the backwards compatibility guarantee offered by Beam. Current dependency
policy [1] try to capture this in a separate section and requires all PRs
that upgrade dependencies to contain a statement regarding backwards
compatibility.

I agree that there might be many modifications we have to make to existing
policies when it comes to upgrading Beam dependencies in according to
industry standards. Current policies are there as a first version for us to
try out. We should definitely time to time reevaluate and update the
policies as needed. I'm also extremely eager to hear what others in the
community think about this.

Thanks,
Cham

[1] https://beam.apache.org/contribute/dependencies/

On Tue, Aug 28, 2018 at 8:38 AM Ismaël Mejía  wrote:

> I think we should refine the strategy on dependencies discussed
> recently. Sorry to come late with this (I did not follow closely the
> previous discussion), but the current approach is clearly not in line
> with the industry reality (at least not for IO connectors + Hadoop +
> Spark/Flink use).
>
> A really proactive approach to dependency updates is a good practice
> for the core dependencies we have e.g. Guava, Bytebuddy, Avro,
> Protobuf, etc, and of course for the case of cloud based IOs e.g. GCS,
> Bigquery, AWS S3, etc. However when we talk about self hosted data
> sources or processing systems this gets more complicated and I think
> we should be more flexible and do this case by case (and remove these
> from the auto update email reminder).
>
> Some open source projects have at least three maintained versions:
> - LTS – maps to what most of the people have installed (or the big
> data distributions use) e.g. HBase 1.1.x, Hadoop 2.6.x
> - Stable – current recommended version. HBase 1.4.x, Hadoop 2.8.x
> - Next – latest release. HBase 2.1.x Hadoop 3.1.x
>
> Following the most recent versions can be good to be close to the
> current development of other projects and some of the fixes, but these
> versions are commonly not deployed for most users and adopting a LTS
> or stable only approach won't satisfy all cases either. To understand
> why this is complex let’s see some historical issues:
>
> IO versioning
> * Elasticsearch. We delayed the move to version 6 until we heard of
> more active users needing it (more deployments). We support 2.x and
> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
> because most big data distributions still use 5.x (however 5.x has
> been EOL).
> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
> most of the deployments of Kafka use earlier versions than 1.x. This
> module uses a single version with the kafka client as a provided
> dependency and so far it works (but we don’t have multi version
> tests).
>
> Runners versioning
> * The move to Spark 1 to Spark 2 was decided after evaluating the
> tradeoffs between maintaining multiple version support and to have
> breaking changes with the issues of maintaining multiple versions.
> This is a rare case but also with consequences. This dependency is
> provided but we don't actively test issues on version migration.
> * Flink moved to version 1.5, introducing incompatibility in
> checkpointing (discussed recently and with not yet consensus on how to
> handle).
>
> As you can see, it seems really hard to have a solution that fits all
> cases. Probably the only rule that I see from this list is that we
> should upgrade versions for connectors that have been deprecated or
> arrived to the EOL (e.g. Solr 5.x, Elasticsearch 2.x).
>
> For the case of the provided dependencies I wonder if as part of the
> tests we should provide tests with multiple versions (note that this
> is currently blocked by BEAM-4087).
>
> Any other ideas or opinions to see how we can handle this? What other
> people in the community think ? (Notice that this can have relation
> with the ongoing LTS discussion.
>
>
> On 

Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

2018-08-28 Thread Ismaël Mejía
I think we should refine the strategy on dependencies discussed
recently. Sorry to come late with this (I did not follow closely the
previous discussion), but the current approach is clearly not in line
with the industry reality (at least not for IO connectors + Hadoop +
Spark/Flink use).

A really proactive approach to dependency updates is a good practice
for the core dependencies we have e.g. Guava, Bytebuddy, Avro,
Protobuf, etc, and of course for the case of cloud based IOs e.g. GCS,
Bigquery, AWS S3, etc. However when we talk about self hosted data
sources or processing systems this gets more complicated and I think
we should be more flexible and do this case by case (and remove these
from the auto update email reminder).

Some open source projects have at least three maintained versions:
- LTS – maps to what most of the people have installed (or the big
data distributions use) e.g. HBase 1.1.x, Hadoop 2.6.x
- Stable – current recommended version. HBase 1.4.x, Hadoop 2.8.x
- Next – latest release. HBase 2.1.x Hadoop 3.1.x

Following the most recent versions can be good to be close to the
current development of other projects and some of the fixes, but these
versions are commonly not deployed for most users and adopting a LTS
or stable only approach won't satisfy all cases either. To understand
why this is complex let’s see some historical issues:

IO versioning
* Elasticsearch. We delayed the move to version 6 until we heard of
more active users needing it (more deployments). We support 2.x and
5.x (but 2.x went recently EOL). Support for 6.x is in progress.
* SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
because most big data distributions still use 5.x (however 5.x has
been EOL).
* KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
most of the deployments of Kafka use earlier versions than 1.x. This
module uses a single version with the kafka client as a provided
dependency and so far it works (but we don’t have multi version
tests).

Runners versioning
* The move to Spark 1 to Spark 2 was decided after evaluating the
tradeoffs between maintaining multiple version support and to have
breaking changes with the issues of maintaining multiple versions.
This is a rare case but also with consequences. This dependency is
provided but we don't actively test issues on version migration.
* Flink moved to version 1.5, introducing incompatibility in
checkpointing (discussed recently and with not yet consensus on how to
handle).

As you can see, it seems really hard to have a solution that fits all
cases. Probably the only rule that I see from this list is that we
should upgrade versions for connectors that have been deprecated or
arrived to the EOL (e.g. Solr 5.x, Elasticsearch 2.x).

For the case of the provided dependencies I wonder if as part of the
tests we should provide tests with multiple versions (note that this
is currently blocked by BEAM-4087).

Any other ideas or opinions to see how we can handle this? What other
people in the community think ? (Notice that this can have relation
with the ongoing LTS discussion.


On Tue, Aug 28, 2018 at 10:44 AM Tim Robertson
 wrote:
>
> Hi folks,
>
> I'd like to revisit the discussion around our versioning policy specifically 
> for the Hadoop ecosystem and make sure we are aware of the implications.
>
> As an example our policy today would have us on HBase 2.1 and I have 
> reminders to address this.
>
> However, currently the versions of HBase in the major hadoop distros are:
>
>  - Cloudera 5 on HBase 1.2 (Cloudera 6 is 2.1 but is only in beta)
>  - Hortonworks HDP3 on HBase 2.0 (only recently released so we can assume is 
> not widely adopted)
>  - AWS EMR HBase on 1.4
>
> On the versioning I think we might need a more nuanced approach to ensure 
> that we target real communities of existing and potential users. Enterprise 
> users need to stick to the supported versions in the distributions to 
> maintain support contracts from the vendors.
>
> Should our versioning policy have more room to consider on a case by case 
> basis?
>
> For Hadoop might we benefit from a strategy on which community of users Beam 
> is targeting?
>
> (OT: I'm collecting some thoughts on what we might consider to target 
> enterprise hadoop users - kerberos on all relevant IO, performance, leaking 
> beyond encryption zones with temporary files etc)
>
> Thanks,
> Tim


Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

2018-08-28 Thread Jean-Baptiste Onofré
Hi Tim,

regarding the IO, while ago (at the incubator time of the project), we
discussed how to deal with different versions of the backend API and
dependencies. I proposed to have a release cycle per IO and have a
subproject per IO version, like for instance:

sdks/java/io/elasticsearch-5
sdks/java/io/elasticsearch-6
...

I'm still thinking that the best option, allowing to really leverage the
backend version in the right way.

Regarding the release, it's like what I'm doing in the ServiceMix
Bundles: the IO can have their own release cycle. As we agreed on a
periodical release cycle, not sure if it's still required, but it could
be interesting (why not having a specific repository for IOs).

Regards
JB

On 28/08/2018 10:43, Tim Robertson wrote:
> Hi folks,
> 
> I'd like to revisit the discussion around our versioning policy
> specifically for the Hadoop ecosystem and make sure we are aware of the
> implications.
> 
> As an example our policy today would have us on HBase 2.1 and I have
> reminders to address this.
> 
> However, currently the versions of HBase in the major hadoop distros are:
> 
>  - Cloudera 5 on HBase 1.2 (Cloudera 6 is 2.1 but is only in beta)
>  - Hortonworks HDP3 on HBase 2.0 (only recently released so we can
> assume is not widely adopted)
>  - AWS EMR HBase on 1.4
> 
> On the versioning I think we might need a more nuanced approach to
> ensure that we target real communities of existing and potential users.
> Enterprise users need to stick to the supported versions in the
> distributions to maintain support contracts from the vendors.
> 
> Should our versioning policy have more room to consider on a case by
> case basis?
> 
> For Hadoop might we benefit from a strategy on which community of users
> Beam is targeting? 
> 
> (OT: I'm collecting some thoughts on what we might consider to target
> enterprise hadoop users - kerberos on all relevant IO, performance,
> leaking beyond encryption zones with temporary files etc)
> 
> Thanks,
> Tim


[DISCUSS] Versioning, Hadoop related dependencies and enterprise users

2018-08-28 Thread Tim Robertson
Hi folks,

I'd like to revisit the discussion around our versioning policy
specifically for the Hadoop ecosystem and make sure we are aware of the
implications.

As an example our policy today would have us on HBase 2.1 and I have
reminders to address this.

However, currently the versions of HBase in the major hadoop distros are:

 - Cloudera 5 on HBase 1.2 (Cloudera 6 is 2.1 but is only in beta)
 - Hortonworks HDP3 on HBase 2.0 (only recently released so we can assume
is not widely adopted)
 - AWS EMR HBase on 1.4

On the versioning I think we might need a more nuanced approach to ensure
that we target real communities of existing and potential users. Enterprise
users need to stick to the supported versions in the distributions to
maintain support contracts from the vendors.

Should our versioning policy have more room to consider on a case by case
basis?

For Hadoop might we benefit from a strategy on which community of users
Beam is targeting?

(OT: I'm collecting some thoughts on what we might consider to target
enterprise hadoop users - kerberos on all relevant IO, performance, leaking
beyond encryption zones with temporary files etc)

Thanks,
Tim