Re: Spark configuration with 5 nodes

Steve Loughran Sat, 19 Mar 2016 04:08:23 -0700

On 11 Mar 2016, at 16:25, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:


Hi Steve,

My argument has always been that if one is going to use Solid State Disks 
(SSD), it makes sense to have it for NN disks start-up from fsimage etc. 
Obviously NN lives in memory. Would you also rerommend RAID10 (mirroring & 
striping) for NN disks?


I don't have any suggestions there, sorry. That said, NN disks do need to be 
RAIDed for protection against corruption, as they don't have the cross-cluster 
replication. They matter

Thanks





Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 11 March 2016 at 10:42, Steve Loughran 
<ste...@hortonworks.com<mailto:ste...@hortonworks.com>> wrote:

On 10 Mar 2016, at 22:15, Ashok Kumar 
<ashok34...@yahoo.com.INVALID<mailto:ashok34...@yahoo.com.invalid>> wrote:


Hi,

We intend  to use 5 servers which will be utilized for building Bigdata Hadoop 
data warehouse system (not using any propriety distribution like Hortonworks or 
Cloudera or others).

I'd argue that life is if simpler with either of these, or bigtop+ambari built 
up yourself, for the management and monitoring tools more than anything else. 
Life is simpler if there's a web page of cluster status. But: DIY teaches you 
the internals of how things work, which is good for getting your hands dirty 
later on. Just start to automate things from the outset, keep configs under 
SCM, etc. And decide whether or not you want to go with Kerberos (==secure 
HDFS) from the outset. If you don't, put your cluster on a separate isolated 
subnet. You ought to have the boxes on a separate switch anyway if you can, 
just to avoid network traffic hurting anyone else on the switch.

All servers configurations are 512GB RAM, 30TB storage and 16 cores, Ubuntu 
Linux servers. Hadoop will be installed on all the servers/nodes. Server 1 will 
be used for NameNode plus DataNode as well. Server 2 will be  used for standby 
NameNode & DataNode. The rest of the servers will be used as DataNodes..


1. Make sure you've got the HDFS/NN space allocation on the two NNs set up so 
that HDFS blocks don't get into the way of the NN's needs (which ideally should 
be on a separate disk with RAID turned on);
2. Same for the worker nodes; temp space matters
3. On a small cluster, the cost of a DN failure is more significant: a bigger 
fraction of the data will go offline, recovery bandwidth limited to the 4 
remaining boxes, etc, etc. Just be aware of that: in a bigger cluster, a single 
server is usually less traumatic. Though HDFS-599 shows that even facebook can 
lose a cluster or two.

Now we would like to install Spark on each servers to create Spark cluster. Is 
that the good thing to do or we should buy additional hardware for Spark 
(minding cost here) or simply do we require additional memory to accommodate 
Spark as well please. In that case how much memory for each Spark node would 
you recommend?


You should be running your compute work on the same systems as the data, as its 
the "hadoop cluster way"; locality of data ==> performance. If you were to buy 
more hardware, go for more store+compute, rather than just compute.

Spark likes RAM for sharing results; less RAM == more problems. but: you can 
buy extra RAM if you need it, provided you've got space in the servers to put 
it in. Same for storage.

Do make sure that you have ECC memory; there are some papers from google and 
microsoft on that topic if you want links to the details. Without ECC your data 
will be corrupted *and you won't even know*

-Steve

Re: Spark configuration with 5 nodes

Reply via email to