I have a few cassandra/hadoop/pig questions.  I currently have things set up
in a test environment, and for the most part everything works.  But, before
I start to roll things out to production, I wanted to check on/confirm some
things.

When I originally set things up, I used:
http://wiki.apache.org/cassandra/HadoopSupport
http://hadoop.apache.org/common/docs/r0.20.203.0/cluster_setup.html

One difference I noticed between the two guides, which I ignored at the
time, was how "datanodes" are treated.  The wiki said "At least one node in
your cluster will also need to be a datanode. That's because Hadoop uses
HDFS to store information like jar dependencies for your job, static data
(like stop words for a word count), and things like that - it's the
distributed cache. It's a very small amount of data but the Hadoop cluster
needs it to run properly".  But, the hadoop guide (if you follow it blindly
like I did), creates a datanode on all TaskTracker nodes.  I _think_ that is
controlled by the conf/slaves file, but I haven't proved that yet.  Is there
any good reason to run datanodes on only the JobTracker vs. on all nodes?
If I should only run it on the JobTracker, how do I properly stop the
datanodes from starting automatically (when both start-dfs and start-mapred
seem to draw from the same slaves file)?

I noticed a second issue/oddness with datanodes, in that the HDFS data isn't
always small.  The other day I ran out of disk running my pig script.  I
checked, and by default, hadoop creates HDFS in /tmp, and I'm using EC2 (and
/tmp is on the boot device) which is only 10G by default.  Do other people
put HDFS on a different disk?  If yes, I'll really want to only run one
datanode, as I don't want to re-template all of my cassandra nodes to have
HDFS disks vs. one new JobTracker node.

In terms of hardware, I am running small instances (32bit, 2GB) in the test
cluster, while my production cluster is larges (64bit, 7 or 8GB).  I was
going to check the performance impact there, but even on smalls in test I
was able to run hadoop jobs while serving web requests.  I am wondering if
smalls are causing the high HDFS usage though (I think data might "spill"
more, if I'm understanding things correctly).

If these are more hadoop then cassandra questions, let me know and I'll move
my questions around.

I did want to mention that these are small details compared to the amount of
complicated things that worked like a charm during my configuration and
testing of the combination of cassandra/hadoop/pig.  It was impressive :-)

Thanks!

will

Reply via email to