Re: Why are large code drops damaging to a community?

Joan Touzet Sat, 03 Nov 2018 12:38:45 -0700

Hi Myrle,

Thanks for starting this topic, and thanks to everyone else who has
shared their stories.


I'd like to bring up another scenario that is all-too-common in the
larger Apache projects: commercial "value-added" versions that find
their way back to the ASF after some time. I know this has happened for
CouchDB. I believe it's also happened for Hadoop, for OpenOffice, and
for other projects.

In the case of CouchDB, we did this precisely once, with Cloudant's (now
IBM's) bigcouch fork. CouchDB 1.x was a single-server, standalone
database solution. You could set up replication between multiple 1.x
installations and call it "a cluster," but to each server, there was no
special understanding about those other servers.  They were treated
exactly the same as someone replicating data to a laptop, to a phone
running PouchDB, or some other 3rd party software that was not
understood.

The Cloudant bigcouch fork (OSS itself) overlaid on top of CouchDB
proper clustering semantics. Now, 3 servers would work in tandem to
store and index data, and the key principles of distributed computing
were adhered to. There was considerable interest in bigcouch in the
CouchDB community, with the main question always being: when will this
get merged back into CouchDB itself? It was a major reworking of the
internals of the project, it couldn't be done in piecemeal, and
would certainly result in a version 2.0 if it happened.

On top of this, Cloudant then forked CouchDB again, applied the changes
in their bigcouch fork, then applied proprietary changes for their own
web service offering. This version was known as "dbcore" internal to
Cloudant. Most of the changes were specific to running a business on
hosted CouchDB, but as dbcore started, bigcouch was effectively 
abandoned, bugfixes and patches only landing on the private dbcore
repository.

Internal to Cloudant (disclaimer: I was working there at the time, I am
no longer an employee), there were repeated calls for most of dbcore to
be merged back into Apache CouchDB. There was always interest and
desire, but business imperatives kept getting in the way. We knew it
was a major undertaking.

In 2012, shortly after IBM acquired Cloudant, I lead the dev team in the
"big merge" effort to get the code that was deemed acceptable to OSS
back into CouchDB. There were multiple public blog posts, public pull
requests (on GitHub) showing the progress of the merge, and various
awareness campaigns on the mailing list.  It was not a trivial task; it
involved flying one key developer to spend a week with another key
developer to go through the change history commit by commit. (We joking
referred to this as the "Nebraska Merge.")

All of this was done with the express approval of the project, its PMC,
and the large majority of the contributors to the project. I wouldn't
have agreed to lead the charge if that wasn't the case. CouchDB 2.0.0
released with the new code in September 2016.

That's not to say it wasn't disruptive to the community.  At least one
pending set of changes, adding a new feature (that had also been
developed on someone's private fork!), had to be discarded because it was
not compatible with distributed computing considerations post-merge.
Similarly, many new requests for changes or features have had to be sent
back for redesign because they assumed a single-server model. However,
the flipside is also true: we were able to ditch our older web-based UI
(which was un-maintained for many years) and replace it with a modern
HTML5 interface that has brought many, *many* more JavaScript developers
into the community with its ease of development and maintenance.

One key point is that IBM/Cloudant changed their development model as it
related to CouchDB + proprietary features rapidly after this release.
They moved any development that required changes to the OSS code
directly to the main Apache repository, so that no massive code dumps
would happen in the future. In fact, their continued releases of new
OSS-available functionality - like improved indexing, user-partitioned
databases, clustered purge, and the vastly improved UI - all happen out
in the open, through the same ASF process by which we accept changes
from smaller companies and individual contributors. IBM/Cloudant has
done an outstanding job in this, and we're very happy that they do so.
Without this sort of goodwill and cooperation, as a PMC member, I'd be
nervous about another massive code drop disrupting the CouchDB community
again.

Overall, I'd say the community is stronger post-merge than pre-merge,
but the actual merge itself was very disruptive. I'm also glad it
happened. In a way, it feels a bit like having had bypass surgery, I
guess :)

-Joan


----- Original Message -----
> From: "Myrle Krantz" <my...@apache.org>
> To: dev@community.apache.org, "dev" <d...@fineract.apache.org>
> Sent: Thursday, October 18, 2018 7:18:07 AM
> Subject: Why are large code drops damaging to a community?
> 
> Hey all,
> 
> There are many forms of offlist development.  One form of offlist
> development is working on large code drops in private and then
> contributing them all at once.  Threshold size is probably arguable,
> and varies by project; put that aside for the moment.  I've been
> working on an explanation of how large code drops damage community
> and
> code.  I'd love to hear your feedback.  I'm including my project and
> the dev@community list in the hopes that people from other projects
> also have a perspective.  Here it goes:
> 
> 
> Imagine you are an individual contributor on a project.  You would
> like to contribute something.  You see a feature you'd like to add or
> a bug you'd like to fix, a user you would like to support, or a
> release you'd like to test.  You start on it.  You submit your pull
> request, you answer the user's question, you test the release.  You
> continue doing this at a low level for a few months.  You see other
> people starting to contribute too.  This is nice.  You're working
> together with others towards a common goal.  Then, out of the blue a
> company with multiple paid contributors shows up.  Let's name them
> Acme. Acme drops a year of code on the project.  They could do this
> many ways.  For example:  A.) Acme could develop in the repository
> you
> were working in, or B.) Acme could create a project-internal fork and
> create a new repository. C.) Acme could even telegraph months in
> advance that they intend to do this, by posting to the dev list or
> contacting key contributors offlist, or just by having done it a few
> times already.
> 
> 
> A.) First let's imagine that Acme made massive changes in the
> repository you were working in.  Perhaps they already solved the
> problem you solved, but in a different way.  Perhaps, they deleted
> functions you made changes in.  Perhaps they added significant
> functionality you would have liked to help with.  What good were your
> efforts?  Wouldn't you find this discouraging?
> 
> And now you want to continue to make changes, but the code you want
> to
> change has commit messages referencing tickets which you have no
> access to.  Or it has no reference to tickets at all.  You find an
> area that seems to be needlessly complex: can you remove the
> complexity?  You have no way of knowing what you'd be breaking.
> 
> Perhaps you have a proprietary UI which depends on a behavior which
> was removed or changed.  Now your UI is broken.  Because the code
> drop
> is so large, you have no way to reasonably review it for
> incompatibilities.  It is not possible to review a year of
> development
> all at once.  And if your review turns up problems?  Do you accept
> the
> entire pull request or decline the whole thing?  Putting all the code
> into one pull request is a form of blackmail (commonly used in the
> formulation of bills for Congress).  If you want the good you have to
> take the bad.
> 
> 
> B.) Now let's imagine that Acme forked the code and created a new
> repository which they then added to the project.  None of the work
> you
> did is in this new repository.  If those features you implemented
> were
> important to you, you will have to re-introduce them into the new
> repository.
> 
> You'll have to start from zero learning to work in the new
> repository.
> You also had no say in how that code was developed, so maybe the
> feature that you need is unnecessarily difficult to implement in that
> repository.   You don't know why things are the way they are there,
> so
> you're walking through a mine field without a map when you're making
> changes.
> 
> And anyways, why is Acme Corp so certain you had nothing of value to
> add?
> 
> Releasing this code also becomes contentious. Which of the two
> competing repositories gets released?  Both of them? How does the
> project communicate to users about how these pieces fit together.
> 
> 
> C.) Imagine Acme gave you lots of forewarning that this was coming.
> You still have no say in how the code is developed.  You know that
> anything you might contribute could be obsoleted.  You can't tell
> users whether the up-and-coming release will be compatible.  And
> what's the point in testing that release?  You don't know how to
> check
> that your needs are being considered in the architecture of the new
> code base.
> 
> You have no sense of ownership over what comes out of that process.
> 
> You see that nobody else outside of Acme is working on the project
> either, for the same reasons.
> 
> 
> Most contributors would get discouraged and prefer not to participate
> if those were the conditions.  If contributors didn't get
> discouraged,
> they would fairly quickly be taking orders from the employees of Acme
> Corp.  Acme Corp has all the inside information about what's coming
> in
> a year in the next code dump.  Information is power.  Contributors
> who
> are also users may also chose to stop contributing and become free
> riders.  Why not just depend on Acme Corp for all of the development?
> 
> What Acme seems to be getting out of this scenario is an Apache
> feather.  It's a form of free-riding on Apache's reputation.
> 
> 
> Now let's imagine that you are the CTO of another company, let's call
> them Kaushal.  Kaushal is considering taking part in this open source
> project, but they are a competitor to Acme.  As Kaushal's CTO, you
> can
> see, based on commit history, and participation that Acme is
> dominating the project.  You would be smart to expect that Acme would
> take advantage of their dominant position in the project.  Acme could
> deliberately sabotage Kaushal's use cases, or simply 'starve' them by
> convincing people not to help Kaushal.  Kaushal's CTO could respond
> to
> this threat in one of two ways: 1.) Simply not take part on the open
> source project.  Create their own closed source thing, or their own
> open source project, and not engage.  This is the most likely
> response.  2.) Try to dominate the project themselves.  Kaushal has
> the same tools available that Acme has. Kaushal's CTO could tell his
> employees to do long-interval code drops just like Acme is doing.
>  Now
> with two corporations doing long-interval code drops on a project,
> merging the code becomes very very difficult.  Fights about who gets
> to decide what could eventually cause a complete cessation of release
> activity.
> 
> 
> So imagine that all competitors chose to remain just users, and Acme
> remains in control.  Now imagine Acme loses interest in the project.
> Acme found something that will make them more money, or Acme's
> business fails.  Or Acme gets tired of offering their development
> resources to the free riders.  Acme stops contributing to the
> project.
> But the project has become so dependent on Acme that it cannot exist
> without Acme.  When Acme exits, project activity could end.
> 
> 
> Open source projects require transparency, not just as a moral value,
> but as a pragmatic prerequisite for collaboration.  Offlist
> development damages the community *and* the code.
> 
> Best Regards,
> Myrle
> 
> P.S.  Some very interesting research on the game-theoretical aspects
> of modularization in open source:
> http://people.hbs.edu/cbaldwin/DR2/BaldwinClark.ArchOS.Jun03.pdf
> "Does Code Architecture Mitigate Free Riding in the Open Source
> Development Model?"
> 
> I would argue that the information divisibility being applied here at
> the code modularity dimension also applies to the time dimension.
>  So,
> it seems likely the argument against large code drops can be made
> mathematically. Which really tickles the geek in me. : o)
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@community.apache.org
> For additional commands, e-mail: dev-h...@community.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@community.apache.org
For additional commands, e-mail: dev-h...@community.apache.org

Re: Why are large code drops damaging to a community?

Reply via email to