Hi Wilm.

Am 27-11-2014 23:41, schrieb Wilm Schumacher:
Hi Aleks ;),

Am 27.11.2014 um 22:27 schrieb Aleks Laz:
Our application is a nginx/php-fpm/postgresql Setup.
The target design is nginx + proxy features / php-fpm / $DB / $Storage.

.) Can I mix HDFS /HBase for binary data storage and data analyzing?

yes. hbase is perfect for that. For storage it will work (with the
"MOB-extension") and with map reduce you can do whatever data analyzing
you want. I assume you do some image processing with the data?!?!

What's the plan about the "MOB-extension"?

From development point of view I can build HBase with the "MOB-extension" but from sysadmin point of view a 'package' (jar,zip, dep, rpm, ...) is much
easier to maintain.

Currently there are no plans to analyse the images, but who knows what the
future brings.

We need to make some "accesslog" analyzing like piwik or awffull.
Maybe elasticsearch is a better tool for that?

.) What is the preferred way to us HBase  with PHP?

The native client lib is in java. This is the best way to go. But if you
need only basic access from the php application, then thrift or rest
would be a good choice.

http://wiki.apache.org/hadoop/Hbase/ThriftApi
http://wiki.apache.org/hadoop/Hbase/Stargate

Stargate is a cool name ;-)

There are language bindings for both

.) How difficult is it to use HBase with PHP?
Depending on what you are trying to do. If you just do a little
fetching, updating, inserting etc. it's pretty easy. More complicate
stuff I would do in java and expose it by a custom api by a java service.

.) What's a good solution for the 37 TB or the upcoming ~120 TB to
distribute?
   [ ] N Servers with 1 37 TB mountpoints per server?
   [ ] N Servers with x TB mountpoints pers server?
   [ ] other:
that's "not your business". hbase/hadoop does the trick for you. hbase
distributes the data, replicates it etc.. You will only talk to the master.

Well but at the end of the day I will need a physical storage distributed over
x servers.

My question is do I need to care that all servers have enough storage for the
whole data?

As far as I have understood hadoop client see a 'Filesystem' with 37 TB or 120 TB but from the server point of view how should I plan the storage/server
setup for the datanodes.

As from the link below hadoophbase-capacity-planning and

http://blog.cloudera.com/blog/2013/08/how-to-select-the-right-hardware-for-your-new-hadoop-cluster/

#####
....
Here are the recommended specifications for DataNode/TaskTrackers in a balanced Hadoop cluster:

12-24 1-4TB hard disks in a JBOD (Just a Bunch Of Disks) configuration
...
#####

What happen when a datanode have 20TB but the whole hadoop/HBase 2 node cluster have 40?

I see I'm still new to hadoop/HBase concept.

.) Is HBase a good value for $Storage?
yes ;)

.) Is HBase a good value for $DB?
DB-Size is smaller then 1 GB, I would use HBase just for HA features
    of Hadoop.
well, the official documentation says:
»First, make sure you have enough data. If you have hundreds of millions
or billions of rows, then HBase is a good candidate. If you only have a
few thousand/million rows, then using a traditional RDBMS might be a
better choice ...«

Okay so I will stay for this on postgresql with pgbouncer.

In my experience at around 1-10 million rows RDBMS are not really
useable anymore. But I only used small/cheap hardware ... and don't like
RDBMS ;).

;-)

Well, you will have at least 40 million rows ... and the plattform is
growing. I think SQL isn't a choice anymore. And as you have heavy read
and only a few writes hbase is a good fit.

?! why "40 million rows", do you mean the file tables?
In the DB is only some Data like, User account, id for a directory and so on.

.) Due to the fact that HBase is a file-system I could use
      /cams , for binary data
      /DB   , for DB storage
      /logs , for log storage
    but is this wise. On the 'disk' they are different RAIDs.
hbase is a data store. This was probably copy pasted from the original
hadoop question ;).

;-)

.) Should I plan a dedicated Network+Card for the 'cluster
   communication' as for the most other cluster software?
From what I have read it looks not necessary but from security point
   of view, yes.

http://blog.cloudera.com/blog/2010/08/hadoophbase-capacity-planning/

Cloudera employees says that it wouldn't harm if you have to push a lot
of data to the cluster.

Okay, so it is like other cluster setups.

.) Maybe the communication with the componnents (hadoop, zk, ...) could
   be setup ed with TLS?

hbase is build on top of hadoop/hdfs. This in the "hadoop domain".
hadoop can encrypt the transported data by TLS, can encrypt the data on
the disc, you can use kerberos auth (but this stuff I never did) etc.
etc.. So the answer is yes.

Thanks.

Last remark: You seem kind of bound to PHP. The hadoop world is written
in java. Of course there are a lot of ways to do stuff in other
languages, over interfaces etc. But the java api is the most powerful
and sometimes there are no other ways then to use it directly.

Currently, yes php is the main language.
I don't know a good solution for php similar like hadoop, anyone else know one?

I will take a look on

https://wiki.apache.org/hadoop/PoweredBy

to get some Ideas for a working solution.

Best wishes,

Wilm

Thanks for your feedbak.
I will dig deeper into this topic and start to setup the components step by step.

BR Aleks

Reply via email to