On 11 Mar 2016, at 16:25, Mich Talebzadeh <mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:
Hi Steve, My argument has always been that if one is going to use Solid State Disks (SSD), it makes sense to have it for NN disks start-up from fsimage etc. Obviously NN lives in memory. Would you also rerommend RAID10 (mirroring & striping) for NN disks? I don't have any suggestions there, sorry. That said, NN disks do need to be RAIDed for protection against corruption, as they don't have the cross-cluster replication. They matter Thanks Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/> On 11 March 2016 at 10:42, Steve Loughran <ste...@hortonworks.com<mailto:ste...@hortonworks.com>> wrote: On 10 Mar 2016, at 22:15, Ashok Kumar <ashok34...@yahoo.com.INVALID<mailto:ashok34...@yahoo.com.invalid>> wrote: Hi, We intend to use 5 servers which will be utilized for building Bigdata Hadoop data warehouse system (not using any propriety distribution like Hortonworks or Cloudera or others). I'd argue that life is if simpler with either of these, or bigtop+ambari built up yourself, for the management and monitoring tools more than anything else. Life is simpler if there's a web page of cluster status. But: DIY teaches you the internals of how things work, which is good for getting your hands dirty later on. Just start to automate things from the outset, keep configs under SCM, etc. And decide whether or not you want to go with Kerberos (==secure HDFS) from the outset. If you don't, put your cluster on a separate isolated subnet. You ought to have the boxes on a separate switch anyway if you can, just to avoid network traffic hurting anyone else on the switch. All servers configurations are 512GB RAM, 30TB storage and 16 cores, Ubuntu Linux servers. Hadoop will be installed on all the servers/nodes. Server 1 will be used for NameNode plus DataNode as well. Server 2 will be used for standby NameNode & DataNode. The rest of the servers will be used as DataNodes.. 1. Make sure you've got the HDFS/NN space allocation on the two NNs set up so that HDFS blocks don't get into the way of the NN's needs (which ideally should be on a separate disk with RAID turned on); 2. Same for the worker nodes; temp space matters 3. On a small cluster, the cost of a DN failure is more significant: a bigger fraction of the data will go offline, recovery bandwidth limited to the 4 remaining boxes, etc, etc. Just be aware of that: in a bigger cluster, a single server is usually less traumatic. Though HDFS-599 shows that even facebook can lose a cluster or two. Now we would like to install Spark on each servers to create Spark cluster. Is that the good thing to do or we should buy additional hardware for Spark (minding cost here) or simply do we require additional memory to accommodate Spark as well please. In that case how much memory for each Spark node would you recommend? You should be running your compute work on the same systems as the data, as its the "hadoop cluster way"; locality of data ==> performance. If you were to buy more hardware, go for more store+compute, rather than just compute. Spark likes RAM for sharing results; less RAM == more problems. but: you can buy extra RAM if you need it, provided you've got space in the servers to put it in. Same for storage. Do make sure that you have ECC memory; there are some papers from google and microsoft on that topic if you want links to the details. Without ECC your data will be corrupted *and you won't even know* -Steve