On 30 Mar 2015, at 13:27, jay vyas <jayunit100.apa...@gmail.com<mailto:jayunit100.apa...@gmail.com>> wrote:
Just the same as spark was disrupting the hadoop ecosystem by changing the assumption that "you can't rely on memory in distributed analytics"...now maybe we are challenging the assumption that "big data analytics need to distributed"? I've been asking the same question lately and seen similarly that spark performs quite reliably and well on local single node system even for an app which I ran for a streaming app which I ran for ten days in a row... I almost felt guilty that I never put it on a cluster....! Modern machines can be pretty powerful: 16 physical cores HT'd to 32, 384+MB, GPU, giving you lots of compute. What you don't get is the storage capacity to match, and especially, the IO bandwidth. RAID-0 striping 2-4 HDDs gives you some boost, but if you are reading, say, a 4 GB file from HDFS broken in to 256MB blocks, you have that data replicated into (4*4*3) blocks: 48. Algorithm and capacity permitting, you've just massively boosted your load time. Downstream, if data can be thinned down, then you can start looking more at things you can do on a single host : a machine that can be in your Hadoop cluster. Ask YARN nicely and you can get a dedicated machine for a couple of days (i.e. until your Kerberos tokens expire).