On 30 Mar 2015, at 13:27, jay vyas 
<jayunit100.apa...@gmail.com<mailto:jayunit100.apa...@gmail.com>> wrote:


Just the same as spark was disrupting the hadoop ecosystem by changing the 
assumption that "you can't rely on memory in distributed analytics"...now maybe 
we are challenging the assumption that "big data analytics need to distributed"?

I've been asking the same question lately and seen similarly that spark 
performs quite reliably and well on local single node system even for an app 
which I ran for a streaming app which I ran for ten days in a row...  I almost 
felt guilty that I never put it on a cluster....!

Modern machines can be pretty powerful: 16 physical cores HT'd to 32, 384+MB, 
GPU, giving you lots of compute. What you don't get is the storage capacity to 
match, and especially, the IO bandwidth. RAID-0 striping 2-4 HDDs gives you 
some boost, but if you are reading, say, a 4 GB file from HDFS broken in to 
256MB blocks, you have that data  replicated into (4*4*3) blocks: 48. Algorithm 
and capacity permitting, you've just massively boosted your load time. 
Downstream, if data can be thinned down, then you can start looking more at 
things you can do on a single host : a machine that can be in your Hadoop 
cluster. Ask YARN nicely and you can get a dedicated machine for a couple of 
days (i.e. until your Kerberos tokens expire).

Reply via email to