Re: AMIs to use when creating hadoop cluster with whirr

Andrei Savu Wed, 05 Oct 2011 14:10:53 -0700

Any type should work. We can change it later.

On Thu, Oct 6, 2011 at 12:07 AM, John Conwell <[email protected]> wrote:


> Do you guys want it logged as a bug, feature, improvement?  Does it matter?
>
>
> On Wed, Oct 5, 2011 at 1:32 PM, Andrei Savu <[email protected]> wrote:
>
>> I understand. From my point of view this is a bug we should fix. Can you
>> open an issue?
>>
>> On Wed, Oct 5, 2011 at 11:25 PM, John Conwell <[email protected]> wrote:
>>
>>> I thought about that, but the hadoop-site.xml created by whirr has some
>>> of the info needed, but its not the full set of xml elements that get
>>> written to the *-site.xml files on the hadoop cluster.   For example whirr
>>> sets *mapred.reduce.tasks* based on the number task trackers, which is
>>> vital for the job configuration to have.  But the hadoop-size.xml doesnt
>>> have this value.  It only has the core properties needed to allow you to use
>>> the ssh proxy to interact with the name node and job tracker
>>>
>>>
>>>
>>> On Wed, Oct 5, 2011 at 1:11 PM, Andrei Savu <[email protected]>wrote:
>>>
>>>> The files are also created on the local machine in
>>>> ~/.whirr/cluster-name/ so it shouldn't be that hard. The only tricky part 
>>>> is
>>>> to match the Hadoop version from my point of view.
>>>>
>>>> On Wed, Oct 5, 2011 at 11:01 PM, John Conwell <[email protected]> wrote:
>>>>
>>>>> This whole scenario does bring up the question about how people handle
>>>>> this kind of scenario.  To me the beauty of whirr is that it means I can
>>>>> spin up and down hadoop clusters on the fly when my workflow demands it.  
>>>>> If
>>>>> a task gets q'd up that needs mapreduce, I spin up a cluster, solve my
>>>>> problem, gather my data, kill the cluster, workflow goes on.
>>>>>
>>>>> But if my workflow requires the contents of three little files located
>>>>> on a different machine, in a different cluster, and possible a different
>>>>> cloud vendor, that really puts a damper on the whimsical on-the-flyness of
>>>>> creating hadoop resources only when needed.  I'm curious how other people
>>>>> are handling this scenario.
>>>>>
>>>>>
>>>>> On Wed, Oct 5, 2011 at 12:45 PM, Andrei Savu <[email protected]>wrote:
>>>>>
>>>>>> Awesome! I'm glad we figured this out, I was getting worried that we
>>>>>> have a critical bug.
>>>>>>
>>>>>> On Wed, Oct 5, 2011 at 10:40 PM, John Conwell <[email protected]>wrote:
>>>>>>
>>>>>>> Ok...I think I figured it out.  This email thread made me take a look
>>>>>>> at how I'm kicking off my hadoop job.  My hadoop driver, the class that
>>>>>>> links a bunch of jobs together in a workflow, is on a different machine 
>>>>>>> than
>>>>>>> the cluster that hadoop is running on.  This means when I create a new
>>>>>>> Configuration() object it, it tries to load the default hadoop values 
>>>>>>> from
>>>>>>> the class path, but since the driver isnt running on the hadoop cluster 
>>>>>>> and
>>>>>>> doesnt have access to the hadoop cluster's configuration files, it just 
>>>>>>> uses
>>>>>>> the default vales...config for suck.
>>>>>>>
>>>>>>> So I copied the *-site.xml files from my namenode over to the machine
>>>>>>> my hadoop job driver was running from and put it in the class path, and
>>>>>>> shazam...it picked up the hadoop config that whirr created for me.  yay!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Oct 5, 2011 at 10:49 AM, Andrei Savu 
>>>>>>> <[email protected]>wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Oct 5, 2011 at 8:41 PM, John Conwell <[email protected]>wrote:
>>>>>>>>
>>>>>>>>> It looks like hadoop is reading default configuration values from
>>>>>>>>> somewhere and using them, and not reading from
>>>>>>>>> the /usr/lib/hadoop/conf/*-site.xml files.
>>>>>>>>>
>>>>>>>>
>>>>>>>> If you are running CDH the config files are in:
>>>>>>>>
>>>>>>>> HADOOP=hadoop-${HADOOP_VERSION:-0.20}
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> HADOOP_CONF_DIR=/etc/$HADOOP/conf.dist
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> See 
>>>>>>>> https://github.com/apache/whirr/blob/trunk/services/cdh/src/main/resources/functions/configure_cdh_hadoop.sh
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Thanks,
>>>>>>> John C
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Thanks,
>>>>> John C
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Thanks,
>>> John C
>>>
>>>
>>
>
>
> --
>
> Thanks,
> John C
>
>

Re: AMIs to use when creating hadoop cluster with whirr

Reply via email to