Re: [DISCUSSION]Pig releases with different versions of Hadoop

Alan Gates Tue, 08 Nov 2011 08:05:32 -0800

On Nov 7, 2011, at 11:15 AM, Olga Natkovich wrote:

> Hi,
> 
> In the past we have for the most part avoided supporting multiple versions of 
> Hadoop with the same version of Pig. This is about to change with release of 
> Hadoop 23. We need to come up with a strategy on how to support that. There 
> are a couple of issues to consider:
> 
> 
> (1)    Version numbering. Seems like encoding the information in the last 
> version number makes sense. The details of the encoding need to be hashed out


I can see two options.  One is to do major.minor.patch.hadoopversion, so for 
example 0.10.1.h23 and 0.10.1.h20.  The problem I see with that is we *have* to 
guarantee that they have the same functionality.  That is, 0.10.1 has all the 
same patches regardless of which Hadoop version it is (excepting maybe patches 
specific to a particular Hadoop version), the only difference is which one it's 
compiled for.  Another problem is that this will proliferate versions, 
cluttering up our website, confusing our users, and causing the PMC members 
vote after vote.

The second option would be to rework the pig package so that it had the jars 
for both, and the pig shell script figures out based on the Hadoop it finds 
which version is being used.  This has the nice feature of guaranteeing the 
same features, but it has a few downsides.  One, it bloats our package (since 
it's carrying multiple jars).  Two, what happens when someone wants to add 
support for a new version (say Hadoop 22) to an existing release?  Three, now a 
release manager must have access to all versions of Hadoop we claim to cover, 
or wait for help from those who do, in order to test a release.    

Hive chose the second option, and dealt with the bloating issue by isolating 
all the version specific code in one jar.  

We could deal with the concern of adding new versions to an existing release by 
saying it's not allowed.  If you want to add a new supported version then you 
create a new version.  This will devolve into versions 0.10 and 0.12 work on 20 
and 23, but 0.11 works on 22.  That will be horribly confusing for our users.

I think the third issue of testability is going to mean certain Pig versions 
only support certain Hadoop versions without it being explicitly marked as 
well.  Again, I think this is really bad.

So I vote for the major.minor.patch.hadoopversion solution, though I think we 
should work hard to make it clear to users how to select the right version of 
Pig when downloading it.


> 
> (2)    Code changes required to support different version of Hadoop. This 
> time around we made an effort to make sure that the same code can work with 
> both. In the future that might not work and we would need to figure out how 
> to maintain different code base. Most likely we would have to have additional 
> branches off of main release branch

Hopefully we can continue to do this via conditional compilation.  Having 
different branches isn't maintainable.  How do I push a Hadoop version specific 
patch to the next release?  We'll get an ever growing collection of patches 
that have to be applied on a Hadoop specific branch for every release.  We need 
to continue the rule that any patch must apply to the trunk, even when it's 
version specific.

> 
> (3)    Anything else we need to consider?
> 
> Olga

Alan.

Re: [DISCUSSION]Pig releases with different versions of Hadoop

Reply via email to