Re: [DISCUSSION]Pig releases with different versions of Hadoop

Russell Jurney Tue, 08 Nov 2011 08:17:42 -0800

Option 2 is consistent with 'Pigs eat anything.'

Russell Jurney
twitter.com/rjurney
[email protected]
datasyndrome.com


On Nov 8, 2011, at 8:05 AM, Alan Gates <[email protected]> wrote:

>
> On Nov 7, 2011, at 11:15 AM, Olga Natkovich wrote:
>
>> Hi,
>>
>> In the past we have for the most part avoided supporting multiple versions 
>> of Hadoop with the same version of Pig. This is about to change with release 
>> of Hadoop 23. We need to come up with a strategy on how to support that. 
>> There are a couple of issues to consider:
>>
>>
>> (1)    Version numbering. Seems like encoding the information in the last 
>> version number makes sense. The details of the encoding need to be hashed out
>
> I can see two options.  One is to do major.minor.patch.hadoopversion, so for 
> example 0.10.1.h23 and 0.10.1.h20.  The problem I see with that is we *have* 
> to guarantee that they have the same functionality.  That is, 0.10.1 has all 
> the same patches regardless of which Hadoop version it is (excepting maybe 
> patches specific to a particular Hadoop version), the only difference is 
> which one it's compiled for.  Another problem is that this will proliferate 
> versions, cluttering up our website, confusing our users, and causing the PMC 
> members vote after vote.
>
> The second option would be to rework the pig package so that it had the jars 
> for both, and the pig shell script figures out based on the Hadoop it finds 
> which version is being used.  This has the nice feature of guaranteeing the 
> same features, but it has a few downsides.  One, it bloats our package (since 
> it's carrying multiple jars).  Two, what happens when someone wants to add 
> support for a new version (say Hadoop 22) to an existing release?  Three, now 
> a release manager must have access to all versions of Hadoop we claim to 
> cover, or wait for help from those who do, in order to test a release.
>
> Hive chose the second option, and dealt with the bloating issue by isolating 
> all the version specific code in one jar.
>
> We could deal with the concern of adding new versions to an existing release 
> by saying it's not allowed.  If you want to add a new supported version then 
> you create a new version.  This will devolve into versions 0.10 and 0.12 work 
> on 20 and 23, but 0.11 works on 22.  That will be horribly confusing for our 
> users.
>
> I think the third issue of testability is going to mean certain Pig versions 
> only support certain Hadoop versions without it being explicitly marked as 
> well.  Again, I think this is really bad.
>
> So I vote for the major.minor.patch.hadoopversion solution, though I think we 
> should work hard to make it clear to users how to select the right version of 
> Pig when downloading it.
>
>
>>
>> (2)    Code changes required to support different version of Hadoop. This 
>> time around we made an effort to make sure that the same code can work with 
>> both. In the future that might not work and we would need to figure out how 
>> to maintain different code base. Most likely we would have to have 
>> additional branches off of main release branch
>
> Hopefully we can continue to do this via conditional compilation.  Having 
> different branches isn't maintainable.  How do I push a Hadoop version 
> specific patch to the next release?  We'll get an ever growing collection of 
> patches that have to be applied on a Hadoop specific branch for every 
> release.  We need to continue the rule that any patch must apply to the 
> trunk, even when it's version specific.
>
>>
>> (3)    Anything else we need to consider?
>>
>> Olga
>
> Alan.

Re: [DISCUSSION]Pig releases with different versions of Hadoop

Reply via email to