okies, thank you D, i will start playing around with the Sandbox version.
On Thu, Mar 13, 2014 at 5:55 AM, Dieter De Witte <[email protected]> wrote: > Sandbox is just meant to be a learning environment i guess, to see what's > possible, how things can be connected. The real distribution will have much > higher performance and is the one you need when you want to investigate > performance issues. The only real drawback of the real distributions is > that they take more time to get you started when you sometimes just want to > play around.. > > > 2014-03-12 21:23 GMT+01:00 [email protected] <[email protected]>: > > Hey D, >> Regarding your point 5: "For a proof of concept I would use a ready-made >> virtual machine from one to 3 big vendors - cloudera, mapR and hortonworks" >> >> I want to understand how this virtual setup would work and how much >> master and slaves nodes I can have in this virtual setup and in general >> what are differences between the actual Hadoop Distribution to this virtual >> ready made setups? >> >> Regards, Andy. >> >> >> >> On Wed, Mar 12, 2014 at 4:02 PM, Dieter De Witte <[email protected]>wrote: >> >>> Hi, >>> >>> 1) HDFS is just a file system, it hides the fact that it is distributed. >>> 2) Mapreduce is the most lowlevel analytics tool I think, you can just >>> specify an input and in your map and reduce function define some >>> functionality to deal with this input. No need for HBase,... although they >>> can be extremely useful.. >>> 3) this is all in the hadoop reference: first the namenode finds a place >>> to allocate your data, then it gets copied to the corresponding datanode 1, >>> and from datanode 1 it is copied to datanode 2 (note the numbers have no >>> special meaning) >>> 4) Your data will be on both datanodes. Why would that be a problem? >>> 5) For a proof of concept I would use a ready-made virtual machine from >>> one of the three big vendors: cloudera, mapR or hortonworks >>> 6) Apache version is more basic, the commercial distributions have more >>> built-in features, are easier to work with I guess >>> 7) You have to install them seperately, the main reason to go for one of >>> the vendors maybe? >>> >>> You should defintely have a look at the reference, you don't have to >>> read it from A-Z but it contains sections where every single sentence will >>> answer one of your questions.. >>> >>> Regards, D >>> >>> >>> >>> 2014-03-12 20:37 GMT+01:00 [email protected] <[email protected]>: >>> >>> Thank you Shahab but it would be really nice if I can get some input on >>>> my initial question as it would really help. >>>> >>>> >>>> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus >>>> <[email protected]>wrote: >>>> >>>>> I would suggest that given the level of details that you are looking >>>>> for and fundamental nature of your questions, you should get hold of books >>>>> or online documentation. Basically some reading/research. >>>>> >>>>> Latest edition of >>>>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520is >>>>> highly recommended to begin with. >>>>> >>>>> Regards, >>>>> Shahab >>>>> >>>>> >>>>> On Wed, Mar 12, 2014 at 3:07 PM, [email protected] < >>>>> [email protected]> wrote: >>>>> >>>>>> Hello Team, >>>>>> >>>>>> I am starting off on Hadoop eco-system and wanted to learn first >>>>>> based on my use case if Hadoop is right tool for me. >>>>>> >>>>>> I have only structured data and my goal is to safe this data into >>>>>> Hadoop and take benefit of replication factor. I am using Microsoft tools >>>>>> for doing analysis and it provides me with good drag and drop >>>>>> functionality >>>>>> for creating different kind of anaylsis and also it has hadoop drivers so >>>>>> it can have hadoop as data source for doing analysis. >>>>>> >>>>>> My question here is how benefits YARN architecture give me in tems of >>>>>> analysis that my Microsoft, Netezza of Tableau products are not giving >>>>>> me. >>>>>> I am just trying to understand value of introducing Hadoop in my >>>>>> Architecture in terms of Analysis apart from data replication. Any >>>>>> insights >>>>>> would be very helpful. >>>>>> >>>>>> Also, my goal for POC is related to efficient data storage/retrieval >>>>>> and so >>>>>> >>>>>> 1. how does data retrieval work in hadoop? >>>>>> 2. do i always need to have any kind of data source on top of >>>>>> hdfs like hbase/cassandra/mongo or there is not need for one and i >>>>>> can have >>>>>> all my data stored in hdfs directly and can retrieve them when i need >>>>>> by >>>>>> using different analytic tools that have hdfs as data source? >>>>>> 3. say if i have 3 node cluster, one master and 2 slaves and if >>>>>> am trying to insert data into hadoop then what is the cycle that >>>>>> framework >>>>>> performs to install my data into hdfs - does my process reads all >>>>>> meta data >>>>>> information from master node about where is my slaves nodes and what >>>>>> kind >>>>>> of data should go on which slave node or all data is send to master >>>>>> node >>>>>> and from there depending upon meta data information it reads and >>>>>> decides >>>>>> that what portion of data should be going to which node? >>>>>> 4. Also if i have 3 node cluster with 1 master and 2 slaves and >>>>>> if my data is equally distributed in two nodes and if i have >>>>>> replication >>>>>> set to 2 then where and how will replication take place as i do not >>>>>> have >>>>>> any node vacant for doing replication? >>>>>> 5. Also, for POC, does it make sense to go with Cloudera 3 node >>>>>> free cluster or Hortonworks 3 node free cluster or it makes sense to >>>>>> go >>>>>> with opensource hadoop version and if we go with open source hadoop >>>>>> version >>>>>> then where can we define that which is master node and which is slave >>>>>> node >>>>>> and also can we have all 3 nodes on same machine or we need to have >>>>>> all 3 >>>>>> nodes on different machines? >>>>>> 6. Also, what are the pros and cons with going through >>>>>> Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC >>>>>> point of >>>>>> view? >>>>>> 7. Also, if we go with Hortonworks/Cloudera then what all tools >>>>>> are come clubbed together with Hadoop framework and if we go with >>>>>> Apache >>>>>> Hadoop, do we get any tools like Pig, Hive clubbed together or we >>>>>> have to >>>>>> install them separately? >>>>>> >>>>>> Since am staring off on Hadoop Journey recently, I would really >>>>>> appreciate if community can point me in right direction? >>>>>> >>>>>> Regards, Andy. >>>>>> >>>>> >>>>> >>>> >>> >> >
