Re: HBase and Datawarehouse

Michael Segel Tue, 30 Apr 2013 11:42:38 -0700

Hmmm

I don't recommend HBase in situations where you are not running a M/R 
Framework. Sorry, as much as I love HBase, IMHO there are probably better 
solutions for a standalone NoSQL Databases. (YMMV depending on your use case.) 
The strength of HBase is that its part of the Hadoop Ecosystem.


I would think that it would probably be better to go virtual than go 
multi-region servers on bare hardware.  You take a hit on I/O, but you can work 
around that too. 

But I'm conservative unless I have to get creative. ;-) 

But something to consider when white boarding ideas... 



On Apr 30, 2013, at 1:30 PM, Andrew Purtell <[email protected]> wrote:

> You wouldn't do that if colocating MR. It is one way to soak up "extra" RAM
> on a large RAM box, although I'm not sure I would recommend it (I have no
> personal experience trying it, yet). For more on this where people are
> actively considering it, see
> https://issues.apache.org/jira/browse/BIGTOP-732
> 
> On Tue, Apr 30, 2013 at 11:14 AM, Michael Segel
> <[email protected]>wrote:
> 
>> Multiple RS per host?
>> Huh?
>> 
>> That seems very counter intuitive and potentially problematic w M/R jobs.
>> Could you expand on this?
>> 
>> Thx
>> 
>> -Mike
>> 
>> On Apr 30, 2013, at 12:38 PM, Andrew Purtell <[email protected]> wrote:
>> 
>>> Rules of thumb for starting off safely and for easing support issues are
>>> really good to have, but there are no hard barriers or singular
>> approaches:
>>> use Java 7 + G1GC, disable HBase blockcache in lieu of OS blockcache, run
>>> multiple regionservers per host. It is going to depend on how the cluster
>>> is used and loaded. If we are talking about coprocessors, then effective
>>> limits are less clear, using a coprocessor to integrate an external
>> process
>>> implemented with native code communicating over memory mapped files in
>>> /dev/shm isn't outside what is possible (strawman alert).
>>> 
>>> 
>>> On Tue, Apr 30, 2013 at 5:01 AM, Kevin O'dell <[email protected]
>>> wrote:
>>> 
>>>> Asaf,
>>>> 
>>>> The heap barrier is something of a legend :)  You can ask 10 different
>>>> HBase committers what they think the max heap is and get 10 different
>>>> answers.  This is my take on heap sizes from the many clusters I have
>> dealt
>>>> with:
>>>> 
>>>> 8GB -> Standard heap size, and tends to run fine without any tuning
>>>> 
>>>> 12GB -> Needs some TLC with regards to JVM tuning if your workload tends
>>>> cause churn(usually blockcache)
>>>> 
>>>> 16GB -> GC tuning is a must, and now we need to start looking into MSLab
>>>> and ZK timeouts
>>>> 
>>>> 20GB -> Same as 16GB in regards to tuning, but we tend to need to raise
>> the
>>>> ZK timeout a little higher
>>>> 
>>>> 32GB -> We do have a couple people running this high, but the pain out
>>>> weighs the gains(IMHO)
>>>> 
>>>> 64GB -> Let me know how it goes :)
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Tue, Apr 30, 2013 at 4:07 AM, Andrew Purtell <[email protected]>
>>>> wrote:
>>>> 
>>>>> I don't wish to be rude, but you are making odd claims as fact as
>>>>> "mentioned in a couple of posts". It will be difficult to have a
>> serious
>>>>> conversation. I encourage you to test your hypotheses and let us know
>> if
>>>> in
>>>>> fact there is a JVM "heap barrier" (and where it may be).
>>>>> 
>>>>> On Monday, April 29, 2013, Asaf Mesika wrote:
>>>>> 
>>>>>> I think for Pheoenix truly to succeed, it's need HBase to break the
>> JVM
>>>>>> Heap barrier of 12G as I saw mentioned in couple of posts. since Lots
>>>> of
>>>>>> analytics queries utilize memory, thus since its memory is shared with
>>>>>> HBase, there's so much you can do on 12GB heap. On the other hand, if
>>>>>> Pheonix was implemented outside HBase on the same machine (like Drill
>>>> or
>>>>>> Impala is doing), you can have 60GB for this process, running many
>> OLAP
>>>>>> queries in parallel, utilizing the same data set.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Mon, Apr 29, 2013 at 9:08 PM, Andrew Purtell <[email protected]
>>>>> <javascript:;>>
>>>>>> wrote:
>>>>>> 
>>>>>>>> HBase is not really intended for heavy data crunching
>>>>>>> 
>>>>>>> Yes it is. This is why we have first class MapReduce integration and
>>>>>>> optimized scanners.
>>>>>>> 
>>>>>>> Recent versions, like 0.94, also do pretty well with the 'O' part of
>>>>>> OLAP.
>>>>>>> 
>>>>>>> Urban Airship's Datacube is an example of a successful OLAP project
>>>>>>> implemented on HBase: http://github.com/urbanairship/datacube
>>>>>>> 
>>>>>>> "Urban Airship uses the datacube project to support its analytics
>>>> stack
>>>>>> for
>>>>>>> mobile apps. We handle about ~10K events per second per node."
>>>>>>> 
>>>>>>> 
>>>>>>> Also there is Adobe's SaasBase:
>>>>>>> http://www.slideshare.net/clehene/hbase-and-hadoop-at-adobe
>>>>>>> 
>>>>>>> Etc.
>>>>>>> 
>>>>>>> Where an HBase OLAP application will differ tremendously from a
>>>>>> traditional
>>>>>>> data warehouse is of course in the interface to the datastore. You
>>>> have
>>>>>> to
>>>>>>> design and speak in the language of the HBase API, though Phoenix (
>>>>>>> https://github.com/forcedotcom/phoenix) is changing that.
>>>>>>> 
>>>>>>> 
>>>>>>> On Sun, Apr 28, 2013 at 10:21 PM, anil gupta <[email protected]
>>>>> <javascript:;>
>>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Kiran,
>>>>>>>> 
>>>>>>>> In HBase the data is denormalized but at the core HBase is KeyValue
>>>>>> based
>>>>>>>> database meant for lookups or queries that expect response in
>>>>>>> milliseconds.
>>>>>>>> OLAP i.e. data warehouse usually involves heavy data crunching.
>>>> HBase
>>>>>> is
>>>>>>>> not really intended for heavy data crunching. If you want to just
>>>>> store
>>>>>>>> denoramlized data and do simple queries then HBase is good. For
>>>> OLAP
>>>>>> kind
>>>>>>>> of stuff, you can make HBase work but IMO you will be better off
>>>>> using
>>>>>>> Hive
>>>>>>>> for  data warehousing.
>>>>>>>> 
>>>>>>>> HTH,
>>>>>>>> Anil Gupta
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sun, Apr 28, 2013 at 8:39 PM, Kiran <[email protected]
>>>>> <javascript:;>>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> But in HBase data can be said to be in  denormalised state as the
>>>>>>>>> methodology
>>>>>>>>> used for storage is a (column family:column) based flexible
>>>> schema
>>>>>>> .Also,
>>>>>>>>> from Google's  big table paper it is evident that HBase is
>>>> capable
>>>>> of
>>>>>>>> doing
>>>>>>>>> OLAP.SO where does the difference lie?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> View this message in context:
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> http://apache-hbase.679495.n3.nabble.com/HBase-and-Datawarehouse-tp4043172p4043216.html
>>>>>>>>> Sent from the HBase User mailing list archive at Nabble.com.
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> 
>>>>>>>  - Andy
>>>>>>> 
>>>>>>> Problems worthy of attack prove their worth by hitting back. - Piet
>>>>> Hein
>>>>>>> (via Tom White)
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Best regards,
>>>>> 
>>>>>  - Andy
>>>>> 
>>>>> Problems worthy of attack prove their worth by hitting back. - Piet
>> Hein
>>>>> (via Tom White)
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Kevin O'Dell
>>>> Systems Engineer, Cloudera
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Best regards,
>>> 
>>>  - Andy
>>> 
>>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>>> (via Tom White)
>> 
>> 
> 
> 
> -- 
> Best regards,
> 
>   - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)

Re: HBase and Datawarehouse

Reply via email to