Hi, There was a mention of Elasticsearch here that caught my attention. We use both HBase and Elasticsearch at Sematext. SPM <http://sematext.com/spm/>, which monitors things like Hadoop, Spark, etc. etc. including HBase and ES, can actually use either HBase or Elasticsearch as the data store. We experimented with both and an a few years old version of HBase was more scalable than the latest ES, at least in our use case.
Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/ On Thu, Nov 27, 2014 at 7:32 PM, Aleks Laz <[email protected]> wrote: > Dear wilm and ted. > > Thanks for your input and ideas. > > I will now step back and learn more about big data and big storage to > be able to talk further. > > Cheers Aleks > > Am 28-11-2014 01:20, schrieb Wilm Schumacher: > > Am 28.11.2014 um 00:32 schrieb Aleks Laz: >> >>> What's the plan about the "MOB-extension"? >>> >> https://issues.apache.org/jira/browse/HBASE-11339 >> >> From development point of view I can build HBase with the "MOB-extension" >>> but from sysadmin point of view a 'package' (jar,zip, dep, rpm, ...) is >>> much >>> easier to maintain. >>> >> that's true :/ >> >> We need to make some "accesslog" analyzing like piwik or awffull. >>> >> I see. Well, this is of course possible, too. >> >> Maybe elasticsearch is a better tool for that? >>> >> I used elastic search for full text search. Works veeery well :D. Loved >> it. But I never used it as primary database. And I wouldn't see an >> advantage for using ES here. >> >> As far as I have understood hadoop client see a 'Filesystem' with 37 TB >>> or >>> 120 TB but from the server point of view how should I plan the >>> storage/server >>> setup for the datanodes. >>> >> now I get your question. If you have a replication factor of 3 (so every >> data is hold three times by the cluster), then the aggregated storage >> has to be at least 3 times the 120 TB (+ buffer + operating system >> etc.). So you could use 360 1TB nodes. Or 3 120 TB nodes. >> >> What happen when a datanode have 20TB but the whole hadoop/HBase 2 node >>> cluster have 40? >>> >> well, if it is in a cluster of enough 20 TB nodes, nothing. hbase >> distributes the data over the nodes. >> >> ?! why "40 million rows", do you mean the file tables? >>> In the DB is only some Data like, User account, id for a directory and >>> so on. >>> >> If you use hbase as primary storage, every file would be a row. Think of >> a "blob" in RDBMS. 40 millions files => 40 million rows. >> >> Assume you create an access log for the 40 millions files and assume >> every file is accessed 100 times and every access is a row in another >> "access log" table => 4 billion rows ;). >> >> Currently, yes php is the main language. >>> I don't know a good solution for php similar like hadoop, anyone else >>> know one? >>> >> well, the basic stuff could be done by thrift/rest with a native php >> binding. It depends on what you are trying to do. If it's just CRUD and >> some scanning and filtering, thrift/rest should be enough. But as you >> said ... who knows what the future brings. If you want to do the fancy >> stuff, you should use java and deliver the data to your php application- >> >> Just for completeness: There is HiveQL, too. This is kind of "SQL for >> hadoop". There is a hive client for php (as it is delivered by thrift) >> https://cwiki.apache.org/confluence/display/Hive/HiveClient >> >> Another fitting option for your access log could be cassandra. Cassandra >> is good at write performance, thus it is used for logging. Cassandra has >> a "sql like" language, called cql. This works from php almost like a >> normal RDBMS. Prepared statements and all this stuff. >> >> But I think this is done the wrong way around. You should select a >> technology and then choose the language/interfaces etc. And if you >> choose hbase, and java is a good choice, and you use nginx and php is a >> good choice, the only task is to deliver data from A to B and back. >> >> Best wishes, >> >> Wilm >> >
