Re: Use Cases for Structured Data

[email protected] Thu, 13 Mar 2014 06:52:54 -0700

okies, thank you D, i will start playing around with the Sandbox version.




On Thu, Mar 13, 2014 at 5:55 AM, Dieter De Witte <[email protected]> wrote:

> Sandbox is just meant to be a learning environment i guess, to see what's
> possible, how things can be connected. The real distribution will have much
> higher performance and is the one you need when you want to investigate
> performance issues. The only real drawback of the real distributions is
> that they take more time to get you started when you sometimes just want to
> play around..
>
>
> 2014-03-12 21:23 GMT+01:00 [email protected] <[email protected]>:
>
> Hey D,
>> Regarding your point 5: "For a proof of concept I would use a ready-made
>> virtual machine from one to 3 big vendors - cloudera, mapR and hortonworks"
>>
>> I want to understand how this virtual setup would work and how much
>> master and slaves nodes I can have in this virtual setup and in general
>> what are differences between the actual Hadoop Distribution to this virtual
>> ready made setups?
>>
>> Regards, Andy.
>>
>>
>>
>> On Wed, Mar 12, 2014 at 4:02 PM, Dieter De Witte <[email protected]>wrote:
>>
>>> Hi,
>>>
>>> 1) HDFS is just a file system, it hides the fact that it is distributed.
>>> 2) Mapreduce is the most lowlevel analytics tool I think, you can just
>>> specify an input and in your map and reduce function define some
>>> functionality to deal with this input. No need for HBase,... although they
>>> can be extremely useful..
>>> 3) this is all in the hadoop reference: first the namenode finds a place
>>> to allocate your data, then it gets copied to the corresponding datanode 1,
>>> and from datanode 1 it is copied to datanode 2 (note the numbers have no
>>> special meaning)
>>> 4) Your data will be on both datanodes. Why would that be a problem?
>>> 5) For a proof of concept I would use a ready-made virtual machine from
>>> one of the three big vendors: cloudera, mapR or hortonworks
>>> 6) Apache version is more basic, the commercial distributions have more
>>> built-in features, are easier to work with I guess
>>> 7) You have to install them seperately, the main reason to go for one of
>>> the vendors maybe?
>>>
>>> You should defintely have a look at the reference, you don't have to
>>> read it from A-Z but it contains sections where every single sentence will
>>> answer one of your questions..
>>>
>>> Regards, D
>>>
>>>
>>>
>>> 2014-03-12 20:37 GMT+01:00 [email protected] <[email protected]>:
>>>
>>> Thank you Shahab but it would be really nice if I can get some input on
>>>> my initial question as it would really help.
>>>>
>>>>
>>>> On Wed, Mar 12, 2014 at 3:11 PM, Shahab Yunus 
>>>> <[email protected]>wrote:
>>>>
>>>>> I would suggest that given the level of details that you are looking
>>>>> for and fundamental nature of your questions, you should get hold of books
>>>>> or online documentation. Basically some reading/research.
>>>>>
>>>>> Latest edition of
>>>>> http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520is 
>>>>> highly recommended to begin with.
>>>>>
>>>>> Regards,
>>>>> Shahab
>>>>>
>>>>>
>>>>> On Wed, Mar 12, 2014 at 3:07 PM, [email protected] <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hello Team,
>>>>>>
>>>>>> I am starting off on Hadoop eco-system and wanted to learn first
>>>>>> based on my use case if Hadoop is right tool for me.
>>>>>>
>>>>>> I have only structured data and my goal is to safe this data into
>>>>>> Hadoop and take benefit of replication factor. I am using Microsoft tools
>>>>>> for doing analysis and it provides me with good drag and drop 
>>>>>> functionality
>>>>>> for creating different kind of anaylsis and also it has hadoop drivers so
>>>>>> it can have hadoop as data source for doing analysis.
>>>>>>
>>>>>> My question here is how benefits YARN architecture give me in tems of
>>>>>> analysis that my Microsoft, Netezza of Tableau products are not giving 
>>>>>> me.
>>>>>> I am just trying to understand value of introducing Hadoop in my
>>>>>> Architecture in terms of Analysis apart from data replication. Any 
>>>>>> insights
>>>>>> would be very helpful.
>>>>>>
>>>>>> Also, my goal for POC is related to efficient data storage/retrieval
>>>>>> and so
>>>>>>
>>>>>>    1. how does data retrieval work in hadoop?
>>>>>>    2. do i always need to have any kind of data source on top of
>>>>>>    hdfs like hbase/cassandra/mongo or there is not need for one and i 
>>>>>> can have
>>>>>>    all my data stored in hdfs directly and can retrieve them when i need 
>>>>>> by
>>>>>>    using different analytic tools that have hdfs as data source?
>>>>>>    3. say if i have 3 node cluster, one master and 2 slaves and if
>>>>>>    am trying to insert data into hadoop then what is the cycle that 
>>>>>> framework
>>>>>>    performs to install my data into hdfs - does my process reads all 
>>>>>> meta data
>>>>>>    information from master node about where is my slaves nodes and what 
>>>>>> kind
>>>>>>    of data should go on which slave node or all data is send to master 
>>>>>> node
>>>>>>    and from there depending upon meta data information it reads and 
>>>>>> decides
>>>>>>    that what portion of data should be going to which node?
>>>>>>    4. Also if i have 3 node cluster with 1 master and 2 slaves and
>>>>>>    if my data is equally distributed in two nodes and if i have 
>>>>>> replication
>>>>>>    set to 2 then where and how will replication take place as i do not 
>>>>>> have
>>>>>>    any node vacant for doing replication?
>>>>>>    5. Also, for POC, does it make sense to go with Cloudera 3 node
>>>>>>    free cluster or Hortonworks 3 node free cluster or it makes sense to 
>>>>>> go
>>>>>>    with opensource hadoop version and if we go with open source hadoop 
>>>>>> version
>>>>>>    then where can we define that which is master node and which is slave 
>>>>>> node
>>>>>>    and also can we have all 3 nodes on same machine or we need to have 
>>>>>> all 3
>>>>>>    nodes on different machines?
>>>>>>    6. Also, what are the pros and cons with going through
>>>>>>    Hortonworks/Cloudera as opposed to Apache Hadoop from initial POC 
>>>>>> point of
>>>>>>    view?
>>>>>>    7. Also, if we go with Hortonworks/Cloudera then what all tools
>>>>>>    are come clubbed together with Hadoop framework and if we go with 
>>>>>> Apache
>>>>>>    Hadoop, do we get any tools like Pig, Hive clubbed together or we 
>>>>>> have to
>>>>>>    install them separately?
>>>>>>
>>>>>> Since am staring off on Hadoop Journey recently, I would really
>>>>>> appreciate if community can point me in right direction?
>>>>>>
>>>>>> Regards, Andy.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Use Cases for Structured Data

Reply via email to