Re: Newbie Question about 37TB binary storage on HBase

Otis Gospodnetic Thu, 27 Nov 2014 21:41:17 -0800

Hi,

There was a mention of Elasticsearch here that caught my attention.
We use both HBase and Elasticsearch at Sematext.  SPM
<http://sematext.com/spm/>, which monitors things like Hadoop, Spark, etc.
etc. including HBase and ES, can actually use either HBase or Elasticsearch
as the data store.  We experimented with both and an a few years old
version of HBase was more scalable than the latest ES, at least in our use
case.


Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On Thu, Nov 27, 2014 at 7:32 PM, Aleks Laz <[email protected]> wrote:

> Dear wilm and ted.
>
> Thanks for your input and ideas.
>
> I will now step back and learn more about big data and big storage to
> be able to talk further.
>
> Cheers Aleks
>
> Am 28-11-2014 01:20, schrieb Wilm Schumacher:
>
>  Am 28.11.2014 um 00:32 schrieb Aleks Laz:
>>
>>> What's the plan about the "MOB-extension"?
>>>
>> https://issues.apache.org/jira/browse/HBASE-11339
>>
>>  From development point of view I can build HBase with the "MOB-extension"
>>> but from sysadmin point of view a 'package' (jar,zip, dep, rpm, ...) is
>>> much
>>> easier to maintain.
>>>
>> that's true :/
>>
>>  We need to make some "accesslog" analyzing like piwik or awffull.
>>>
>> I see. Well, this is of course possible, too.
>>
>>  Maybe elasticsearch is a better tool for that?
>>>
>> I used elastic search for full text search. Works veeery well :D. Loved
>> it. But I never used it as primary database. And I wouldn't see an
>> advantage for using ES here.
>>
>>  As far as I have understood hadoop client see a 'Filesystem' with 37 TB
>>> or
>>> 120 TB but from the server point of view how should I plan the
>>> storage/server
>>> setup for the datanodes.
>>>
>> now I get your question. If you have a replication factor of 3 (so every
>> data is hold three times by the cluster), then the aggregated storage
>> has to be at least 3 times the 120 TB (+ buffer + operating system
>> etc.). So you could use 360 1TB nodes. Or 3 120 TB nodes.
>>
>>  What happen when a datanode have 20TB but the whole hadoop/HBase 2 node
>>> cluster have 40?
>>>
>> well, if it is in a cluster of enough 20 TB nodes, nothing. hbase
>> distributes the data over the nodes.
>>
>>  ?! why "40 million rows", do you mean the file tables?
>>> In the DB is only some Data like, User account, id for a directory and
>>> so on.
>>>
>> If you use hbase as primary storage, every file would be a row. Think of
>> a "blob" in RDBMS. 40 millions files => 40 million rows.
>>
>> Assume you create an access log for the 40 millions files and assume
>> every file is accessed 100 times and every access is a row in another
>> "access log" table => 4 billion rows ;).
>>
>>  Currently, yes php is the main language.
>>> I don't know a good solution for php similar like hadoop, anyone else
>>> know one?
>>>
>> well, the basic stuff could be done by thrift/rest with a native php
>> binding. It depends on what you are trying to do. If it's just CRUD and
>> some scanning and filtering, thrift/rest should be enough. But as you
>> said ... who knows what the future brings. If you want to do the fancy
>> stuff, you should use java and deliver the data to your php application-
>>
>> Just for completeness: There is HiveQL, too. This is kind of "SQL for
>> hadoop". There is a hive client for php (as it is delivered by thrift)
>> https://cwiki.apache.org/confluence/display/Hive/HiveClient
>>
>> Another fitting option for your access log could be cassandra. Cassandra
>> is good at write performance, thus it is used for logging. Cassandra has
>> a "sql like" language, called cql. This works from php almost like a
>> normal RDBMS. Prepared statements and all this stuff.
>>
>> But I think this is done the wrong way around. You should select a
>> technology and then choose the language/interfaces etc. And if you
>> choose hbase, and java is a good choice, and you use nginx and php is a
>> good choice, the only task is to deliver data from A to B and back.
>>
>> Best wishes,
>>
>> Wilm
>>
>

Re: Newbie Question about 37TB binary storage on HBase

Reply via email to