Hi Rita, Re your suggestions: I do not own this new cluster at PNNL. It costs $4 million or so and a lot of different lab groups will be using it. I want to try it because I can get a fair amount of time, nodes, and Lustre disk space at a nominal cost as an internal researcher in this division, as part of the institutional base capability. It already does have a 1 TB hard drive on each of the 480-odd nodes - but the policy (which I cannot change) is that the local disk space gets reclaimed for other jobs when your job is done. The only permanent storage (with yearly renewal) is on the Lustre file system. Which is on RAID6.
I am not that concerned about losing data. This is not a production install where I would have users screaming at me to retrieve their lost data. For now, this is a research project with a relatively small number of users, and I know what data is going in, and where the data comes from. Source files will stay available. Hbase files on Lustre can also be backed up to another file archive cluster in a separate building- which itself gets backed up to tape. If a node goes down and I lose some data, I'll know the tables affected because I'll know what was running. I've worked with relational databases (and logic-based databases, and object-oriented dbs, and, gnu help me, XML-based dbs) for many years - and I've gone through rebuilds. Already have a script that recreates almost all of the tables from scratch, importing the data sets. (Takes a heck of a long time to run, of course.) So - this is not a 24/7 transactional db. It is early-stage warehouse for researchers that will have periodic updates. I am more concerned about getting a nice, long-term environment for properly designing and trying out a data warehouse for large scale biological data integration than in worrying about max speed, or minor, recoverable data loss due to node drop. (Though that would change, of course, if a larger number of users comes to pass.) . I just read that David Haussler at UC Santa Cruz is creating a five-petabyte repository of 20,0000 genomes for cancer research. (Note to self: send him an email and find out what he's using - Hadoop? HBase or some other NoSQL db?) Think along those lines for what the PNNL data warehouse would be for - except we are more interested in next gen sequencing RNA-seq transcriptomics data and mass spec proteomics data (combined with annotated genomes, of course, and other background public data on the organisms, proteins, etc.). So I want a nice environment in which I can do analytics, and where I look into laying the foundation for possible automated reasoning capabilities (parallelized Java rule engines, anyone? Possible Mahout use?). And possible ontology use. I want to be able to run analysis programs that simply were not possible otherwise - and then publish the new biological knowledge, since I'm in research and want to keep my job. So I want something that uses HBase for scaling, and which I don't have to hugely alter later on with a shift to another platform. Want to choose the right tools (Hadoop, HBase, the Hadoop ecosystem, that's the plan) that will leverage my prior work best - and allow wide flexibility in revising tables as I learn what works and what does not in supporting our analyses . Programmer effort (to wit, me) is the most important criterion here. If I have to move from this shared cluster later to a dedicated (and vastly smaller) Hadoop cluster which have to I pay for, no real damage done - the HBase db layout and Hadoop programs stay the same. But if this does work - on this very large ~100 Teraflop + Lustre cluster -, then for heavy-duty Big Data analysis programs I might be able to put in requests to use a large fraction of the cores at times (approaching 15,000 cores next year). Nice flexibility, if this all works out. And PNNL institutional support will be maintaining the cluster - I would have not hardware worries, no worries about doing OS updates, handling emergencies, and so on. I and my data warehouse would glide along, undisturbed by those mundane details. Hopefully a big diff from running my current cluster, let me tell you. Heard back from a couple other folks, who suggested I could at least give HBase a try on the Lustre system. We will see. I'll let the list know whether this works - or if I fall flat on my face. But I'm fairly optimistic right now - the cluster' s support staff here are superb and, as I said, they already have Cloudera CDH3-U2 up and running, working on terabytes of data (in HDFS, temporarily placed on local disks), invoking such on subsets of nodes using custom scripts. Just need to add in HBase, pointing it as Lustre. Fingers crossed. Ron Ronald Taylor, Ph.D. Computational Biology & Bioinformatics Group Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle) Richland, WA 99352 phone: (509) 372-6568 email: [email protected] From: Rita [mailto:[email protected]] Sent: Tuesday, December 06, 2011 3:53 PM To: [email protected] Cc: [email protected]; Taylor, Ronald C; Carlson, Timothy S; Regimbal, Kevin; Wiley, Steven Subject: Re: want to try HBase on a large cluster running Lustre - any advice? How would you handle a node failure? Do you have shared storage which exports LUNs to the datanodes? The beauty of hbase+hdfs is you can afford nodes going down (depending on your replication policy). Lustre is a great filesystem for scratch high performance filesystem but using it as a backend for hbase, I think isnt a good idea. I suggest you buy a disk for each node and then run hdfs and hbase on top of it. Thats what I do. On Mon, Dec 5, 2011 at 5:04 PM, Taylor, Ronald C <[email protected]<mailto:[email protected]>> wrote: Hello Lars, Thanks for your previous help. Got a new question for you. I now have the opportunity to try using Hadoop and HBase on a newly installed cluster here, at a nominal cost. A lot of compute power (480+ nodes, 16 cores per node going up to 32 by the end of FY12, 64 GB RAM per node, with a few fat nodes with 256GB). One local drive of 1TB per node, and a four petabyte Lustre file system. Hadoop jobs are already running on this new cluster, on terabyte size data sets. Here's the drawback: I cannot permanently store HBase tables on local disk. After a job finishes, the disks are reclaimed. So - if I want to build a continuously available data warehouse (basically for analytics runs, not for real-time web access by a large community at present - just me and other internal bioinformatics folk here at PNNL) I need to put the HBase tables on the Lustre file system. Now, all the nodes in this cluster have a very fast InfiniBand QDR network interconnect. I think it's something like 40 gigabits/sec, as compared to the 1 gigabit/sec that you might see in a run-of-the-mill Hadoop cluster. And I just read a couple white papers that say that if the network interconnect is good enough, the loss of data locality when you use Lustre with Hadoop is not such a bad thing. That is, I Googled and found several papers on HDFS vs Lustre. The latest one I found (2011) is a white paper from a company called Xyratex. Here's a quote from it: The use of clustered file systems as a backend for Hadoop storage has been studied previously. The performance of distributed file systems such as Lustre2 , Ceph3 , PVFS4 , and GPFS5 with Hadoop has been compared to that of HDFS. Most of these investigations have shown that non-HDFS file systems perform more poorly than HDFS, although with various optimizations and tuning efforts, a clustered file system can reach parity with HDFS. However, a consistent limitation in the studies of HDFS and non-HDFS performance with Hadoop is that they used the network infrastructure to which Hadoop is limited, TCP/IP, typically over 1 GigE. In HPC environments, where much faster network interconnects are available, significantly better clustered file system performance with Hadoop is possible. Anyway, I am not principally worried about speed or efficiency right now - this cluster is big enough that even if I do not use it most efficiently, I'll still be doing better than with my very small current cluster, which has very limited RAM and antique processors. My question is: will HBase work at all on Lustre? That is, on pp. 52-54 of your O'Reilly HBase book, you say that "... you are not locked into HDFS because the "FileSystem" used by HBase has a pluggable architecture and can be used to replace HDFS with any other supported system. The possibilities are endless and waiting for the brave at heart." ... "You can select a different filesystem implementation by using a URI pattern, where the scheme (the part before the first ":", i.e., the colon) part of the URI identifies the driver to be used." We use HDFS by setting the URI to hdfs://<namenode>:port/<path> And you say to simply use the local file system a desktop Linux box (which would not replicate data or maintain copies of the files - no fault tolerance) one uses file:///<path<file:///\\%3cpath>> So - can I simply change this one param, and point HBase to a location in the Lustre file system? That Is, use <property> <name>hbase.rootdir</name> <value>file:///pic/scratch/rtaylor/hbase</value<file:///\\pic\scratch\rtaylor\hbase%3c\value>> </property> Where "/pic" points to the root of the Lustre system. Or use something similar? I am told that all of the Lustre OSTs are backed by RAID6, so my HBase tables would be fairly safe from hardware failure. If you put a file into the Lustre file system, chances are very slim you are going to lose it from a hardware failure. Also, I can make copies periodically to our gigantic file storage cluster in a separate building. This does not need to be a production HBase system (at least, for now). This is more of a data warehouse / analytics /data integration environment for several bioinformatics scientists, a system that we can afford to go down from time to time, in a research environment. Note that when I use Hadoop by itself, or Hadoop with HBase tables as sources and sinks, only the Hbase accesses would be from the Lustre file system. The Hadoop program would still be able to use HDFS on local disks on the subset of nodes allotted to it on the cluster - as the Hadoop programs now running on this new cluster as doing. My problem is just I don't want to have to rebuild the Hbase tables every time I want to do something, since the local disk space is retrieved for other possible users after a job finishes. But I can get permanent (well, yearly renewal) disk space on the Lustre system. So - any advice, before I give this a try? Will changing this one HBase config parameter suffice to get me started? Or are there other things involved? - Ron Ronald Taylor, Ph.D. Computational Biology & Bioinformatics Group Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle) Richland, WA 99352 phone: (509) 372-6568<tel:%28509%29%20372-6568> email: [email protected]<mailto:[email protected]> -----Original Message----- From: Lars George [mailto:[email protected]<mailto:[email protected]>] Sent: Wednesday, November 30, 2011 6:34 AM To: [email protected]<mailto:[email protected]> Subject: Re: getting HBase up after an unexpected power failure - need some advice Hey, Looks like you have a corrupted ZK. Try and stop ZK (after stopping HBase of course) and restart it. If that also fails, then wipe the data dir ZK uses (check the config, for example the zoo.cfg for stand alone ZK nodes). ZK is going to recreate the data files and it should be able to move forward. Cheers, Lars -- --- Get your facts first, then you can distort them as you please.--
