This whole scenario does bring up the question about how people handle this kind of scenario. To me the beauty of whirr is that it means I can spin up and down hadoop clusters on the fly when my workflow demands it. If a task gets q'd up that needs mapreduce, I spin up a cluster, solve my problem, gather my data, kill the cluster, workflow goes on.
But if my workflow requires the contents of three little files located on a different machine, in a different cluster, and possible a different cloud vendor, that really puts a damper on the whimsical on-the-flyness of creating hadoop resources only when needed. I'm curious how other people are handling this scenario. On Wed, Oct 5, 2011 at 12:45 PM, Andrei Savu <[email protected]> wrote: > Awesome! I'm glad we figured this out, I was getting worried that we have a > critical bug. > > On Wed, Oct 5, 2011 at 10:40 PM, John Conwell <[email protected]> wrote: > >> Ok...I think I figured it out. This email thread made me take a look at >> how I'm kicking off my hadoop job. My hadoop driver, the class that links a >> bunch of jobs together in a workflow, is on a different machine than the >> cluster that hadoop is running on. This means when I create a new >> Configuration() object it, it tries to load the default hadoop values from >> the class path, but since the driver isnt running on the hadoop cluster and >> doesnt have access to the hadoop cluster's configuration files, it just uses >> the default vales...config for suck. >> >> So I copied the *-site.xml files from my namenode over to the machine my >> hadoop job driver was running from and put it in the class path, and >> shazam...it picked up the hadoop config that whirr created for me. yay! >> >> >> >> On Wed, Oct 5, 2011 at 10:49 AM, Andrei Savu <[email protected]>wrote: >> >>> >>> On Wed, Oct 5, 2011 at 8:41 PM, John Conwell <[email protected]> wrote: >>> >>>> It looks like hadoop is reading default configuration values from >>>> somewhere and using them, and not reading from >>>> the /usr/lib/hadoop/conf/*-site.xml files. >>>> >>> >>> If you are running CDH the config files are in: >>> >>> HADOOP=hadoop-${HADOOP_VERSION:-0.20} >>> >>> >>> >>> >>> HADOOP_CONF_DIR=/etc/$HADOOP/conf.dist >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> See >>> https://github.com/apache/whirr/blob/trunk/services/cdh/src/main/resources/functions/configure_cdh_hadoop.sh >>> >>> >>> >>> >>> >>> >> >> >> -- >> >> Thanks, >> John C >> >> > -- Thanks, John C
