Sorry to jump in here but does HBase use Map/Reduce under the covers? I was under the impression that HBase used the DFS of Hadoop but not Map/Reduce.
Thanks -Pete -----Original Message----- From: Otis Gospodnetic [mailto:[email protected]] Sent: Friday, February 25, 2011 2:39 PM To: [email protected] Subject: Re: Ad-hoc reports against HBase - any way? any tools? Ah, I have another question: When you have HBase-Hive integration in place, how do you control how queries that come in via Hive affect the HBase cluster? Consider a HBase cluster whose primary task is to ingest data, process it with MR jobs, and store it back in some table(s). That's what the cluster does today. Now if we add HBase-Hive to the mix and people start writing HQL and that runs MR jobs against data in HBase, this will affect the performance of those data ingestion jobs. How do you deal with that? Are there ways to maybe split the cluster in such a way that HQL-triggered MR jobs run only on some set of nodes, while MR jobs that are part of ingestion process run on a disjoint set of nodes? Yes, have Hive's nodes see the new data that continuously gets ingested. Thanks, Otis ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Hadoop - HBase Hadoop ecosystem search :: http://search-hadoop.com/ ----- Original Message ---- > From: Jean-Daniel Cryans <[email protected]> > To: [email protected] > Sent: Fri, February 25, 2011 5:04:31 PM > Subject: Re: Ad-hoc reports against HBase - any way? any tools? > > HIVE-1634 will be a serious limitation if anything you store is in > binary and you don't want to patch hive. There's also some skew issues > I have yet to investigate that may be due to the hbase integration... > or not. Apart from that, our internal users are pretty happy. > > J-D > > On Fri, Feb 25, 2011 at 1:54 PM, Otis Gospodnetic > <[email protected]> wrote: > > Hi J-D, > > > > Yes, I'm interested in HBase-Hive integration. > > Thanks for the pointer to the external tables. I was aware of that at some > > point, but for some reason started thinking that data copying is necessary. > > > > Are there any gotchas or serious limitations around this integration? > > > > Thanks, > > Otis > > > > > > > > > > > > ----- Original Message ---- > >> From: Jean-Daniel Cryans <[email protected]> > >> To: [email protected] > >> Sent: Fri, February 25, 2011 4:17:09 PM > >> Subject: Re: Ad-hoc reports against HBase - any way? any tools? > >> > >> We use the HBase+Hive integration here for ad-hoc queries, I don't > >> understand the data duplication you're talking about... when you > >> create an external table you can directly query your existing tables. > >> We run with the latest patch posted in HIVE-1634 since we have a lot > >> of binary values and I made a very very hacky patch to be able to use > >> our binary composite row keys. > >> > >> I'll be happy to give you more details if you want to try going down that > >>road. > >> > >> J-D > >> > >> On Fri, Feb 25, 2011 at 1:02 PM, Otis Gospodnetic > >> <[email protected]> wrote: > >> > Hello, > >> > > >> > I have a HBase cluster chock-full of data and would like to run canned > >>reports > >> > (i.e., > >> > > >> > reports known ahead of time), but also ad-hoc reports against that >data. > >> > Are there any open-source or commercial tools one can use? > >> > > >> > Here's what I *think* I know so far, but please correct me wherever I >wrong, > >> > >>so > >> > I don't spread false info: > >> > > >> > * Use HBase-Hive Integration > >> > Pluses: > >> > - lots of tools to query Hive are available > >> > Minuses: > >> > - data duplication > >> > - Hive's copy of data is always behind > >> > - I heard the integration is fairly alpha (e.g. you can't copy > >> > deltas >to > >> > Hive, you have to copy all data every time you want to update your Hive > >>store) > >> > > >> > * Use Pig > >> > https://issues.apache.org/jira/browse/PIG-970 > >> > https://issues.apache.org/jira/browse/PIG-1205 > >> > Pluses: > >> > - runs directly against HBase, no need to copy data > >> > Minuses: > >> > - PigLatin learning curve - in my case people wanting ad-hoc > >> > reports >are > >> > >>not > >> > > >> > techies > >> > - No pretty front-end with syntax highlighting or visual querying or > > that > >> > accepts SQL and translates it to PigLatin > >> > > >> > * Use PigPen > >> > Pluses: > >> > - Visual == easy > >> > Minuses: > >> > - Looks abandoned justing by http://search-hadoop.com/m/Noacz1MECC7 and > >> > https://issues.apache.org/jira/browse/PIG-366 > >> > > >> > * Use Toad for Cloud > >> > Pluses: > >> > - accepts SQL, runs, and returns data > >> > - runs directly against HBase, no need to copy data > >> > Minuses: > >> > - some people reported it crashes > >> > - it allows the person querying the data to also modify the data, >which > >>is > >> > bad in my environment > >> > > >> > * Datameer DAS, Karmasphere Analyst, Pentaho, Beeswax -- they all seem >to > > be > >> > able to get the > >> > > >> > data out of Hive, but not out of HBase. More info below: > >> > > >> > * Pentaho > >> > * http://www.pentaho.com/products/hadoop/ - looks like it supports >only > >>Hive > >> > * http://forums.pentaho.com/showthread.php?77926-HBase-and-ETL > >> > * http://search-hadoop.com/?q=pentaho&src=moz-search > >> > > >> > * Datameer > >> > * http://wiki.datameer.com/display/DAS1/DAS+Supported+Platforms - looks > >>like > >> > it > >> > > >> > supports only Hive > >> > * http://wiki.datameer.com/display/DAS11/Using+the+Plug-in+SDK - looks > >>like > >> > one > >> > > >> > can add support for HBase by writing a plugin? > >> > > >> > Karmasphere Analyst > >> > * >http://www.karmasphere.com/Products-Information/karmasphere-analyst.html > >> > >>- > >> > > >> > Hive only > >> > > >> > > >> > Is any of the above incorrect? > >> > Did I miss a tool, free or non-free, that I could use to run ad-hoc >reports > >> > against data in HBase? > >> > > >> > Thanks, > >> > Otis > >> > ---- > >> > Sematext :: http://sematext.com/ :: Solr - Lucene - Hadoop - HBase > >> > Hadoop ecosystem search :: http://search-hadoop.com/ > >> > > >> > > >> > > > > >
