Hi Steve, My argument has always been that if one is going to use Solid State Disks (SSD), it makes sense to have it for NN disks start-up from fsimage etc. Obviously NN lives in memory. Would you also rerommend RAID10 (mirroring & striping) for NN disks?
Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 11 March 2016 at 10:42, Steve Loughran <ste...@hortonworks.com> wrote: > > On 10 Mar 2016, at 22:15, Ashok Kumar <ashok34...@yahoo.com.INVALID > <ashok34...@yahoo.com.invalid>> wrote: > > > Hi, > > We intend to use 5 servers which will be utilized for building Bigdata > Hadoop data warehouse system (not using any propriety distribution like > Hortonworks or Cloudera or others). > > > I'd argue that life is if simpler with either of these, or bigtop+ambari > built up yourself, for the management and monitoring tools more than > anything else. Life is simpler if there's a web page of cluster status. > But: DIY teaches you the internals of how things work, which is good for > getting your hands dirty later on. Just start to automate things from the > outset, keep configs under SCM, etc. And decide whether or not you want to > go with Kerberos (==secure HDFS) from the outset. If you don't, put your > cluster on a separate isolated subnet. You ought to have the boxes on a > separate switch anyway if you can, just to avoid network traffic hurting > anyone else on the switch. > > All servers configurations are 512GB RAM, 30TB storage and 16 cores, > Ubuntu Linux servers. Hadoop will be installed on all the servers/nodes. > Server 1 will be used for NameNode plus DataNode as well. Server 2 will be > used for standby NameNode & DataNode. The rest of the servers will be used > as DataNodes.. > > > > 1. Make sure you've got the HDFS/NN space allocation on the two NNs set up > so that HDFS blocks don't get into the way of the NN's needs (which ideally > should be on a separate disk with RAID turned on); > 2. Same for the worker nodes; temp space matters > 3. On a small cluster, the cost of a DN failure is more significant: a > bigger fraction of the data will go offline, recovery bandwidth limited to > the 4 remaining boxes, etc, etc. Just be aware of that: in a bigger > cluster, a single server is usually less traumatic. Though HDFS-599 shows > that even facebook can lose a cluster or two. > > Now we would like to install Spark on each servers to create Spark > cluster. Is that the good thing to do or we should buy additional hardware > for Spark (minding cost here) or simply do we require additional memory to > accommodate Spark as well please. In that case how much memory for each > Spark node would you recommend? > > > You should be running your compute work on the same systems as the data, > as its the "hadoop cluster way"; locality of data ==> performance. If you > were to buy more hardware, go for more store+compute, rather than just > compute. > > Spark likes RAM for sharing results; less RAM == more problems. but: you > can buy extra RAM if you need it, provided you've got space in the servers > to put it in. Same for storage. > > Do make sure that you have ECC memory; there are some papers from google > and microsoft on that topic if you want links to the details. Without ECC > your data will be corrupted *and you won't even know* > > -Steve > > >