Re: Spark and HBase

Josh Mahonin Sat, 26 Apr 2014 07:01:01 -0700

We're still in the infancy stages of the architecture for the project I'm on, 
but presently we're investigating HBase / Phoenix data store for it's realtime 
query abilities, and being able to expose data over a JDBC connector is 
attractive for us.


Much of our data is event based, and many of the reports we'd like to do can be 
accomplished using simple SQL queries on that data - assuming they are 
performant. This far, the evidence is showing that it is even across many 
millions of rows.

However, there are a number of models we have that today exist as a combination 
of PIG and python batch jobs that I'd like to replace with Spark, which thus 
far has shown to be more than adequate for what we're doing today.

As far as using Phoenix as an endpoint for a batch load, the only real 
advantage I see over using straight HBase is that I can specify a query to 
prefilter the data before attaching it to an RDD. I haven't run the numbers yet 
to see how this compare to more traditional methods though. 

The only worry I have is that the Phoenix input format doesn't adequately split 
the data across multiple nodes, so that's something I will need to look at 
further.

Josh



> On Apr 25, 2014, at 6:33 PM, Nicholas Chammas <nicholas.cham...@gmail.com> 
> wrote:
> 
> Josh, is there a specific use pattern you think is served well by Phoenix + 
> Spark? Just curious.
> 
> 
>> On Fri, Apr 25, 2014 at 3:17 PM, Josh Mahonin <jmaho...@filetrek.com> wrote:
>> Phoenix generally presents itself as an endpoint using JDBC, which in my 
>> testing seems to play nicely using JdbcRDD. 
>> 
>> However, a few days ago a patch was made against Phoenix to implement 
>> support via PIG using a custom Hadoop InputFormat, which means now it has 
>> Spark support too.
>> 
>> Here's a code snippet that sets up an RDD for a specific query:
>> 
>> --
>> val phoenixConf = new PhoenixPigConfiguration(new Configuration())
>> phoenixConf.setSelectStatement("SELECT EVENTTYPE,EVENTTIME FROM EVENTS WHERE 
>> EVENTTYPE = 'some_type')
>> phoenixConf.setSelectColumns("EVENTTYPE,EVENTTIME")
>> phoenixConf.configure("servername", "EVENTS", 100L)
>> 
>> val phoenixRDD = sc.newAPIHadoopRDD(
>>                         phoenixConf.getConfiguration(), 
>>                      classOf[PhoenixInputFormat], 
>>                              classOf[NullWritable],
>>                      classOf[PhoenixRecord])
>> --
>> 
>> I'm still very new at Spark and even less experienced with Phoenix, but I'm 
>> hoping there's an advantage over the JdbcRDD in terms of partitioning. The 
>> JdbcRDD seems to implement partitioning based on a query predicate that is 
>> user defined, but I think Phoenix's InputFormat is able to figure out the 
>> splits which Spark is able to leverage. I don't really know how to verify if 
>> this is the case or not though, so if anyone else is looking into this, I'd 
>> love to hear their thoughts.
>> 
>> Josh
>> 
>> 
>>> On Tue, Apr 8, 2014 at 1:00 PM, Nicholas Chammas 
>>> <nicholas.cham...@gmail.com> wrote:
>>> Just took a quick look at the overview here and the quick start guide here.
>>> 
>>> It looks like Apache Phoenix aims to provide flexible SQL access to data, 
>>> both for transactional and analytic purposes, and at interactive speeds.
>>> 
>>> Nick
>>> 
>>> 
>>>> On Tue, Apr 8, 2014 at 12:38 PM, Bin Wang <binwang...@gmail.com> wrote:
>>>> First, I have not tried it myself. However, what I have heard it has some 
>>>> basic SQL features so you can query you HBase table like query content on 
>>>> HDFS using Hive. 
>>>> So it is not "query a simple column", I believe you can do joins and other 
>>>> SQL queries. Maybe you can wrap up an EMR cluster with Hbase preconfigured 
>>>> and give it a try. 
>>>> 
>>>> Sorry cannot provide more detailed explanation and help. 
>>>>  
>>>> 
>>>> 
>>>>> On Tue, Apr 8, 2014 at 10:17 AM, Flavio Pompermaier 
>>>>> <pomperma...@okkam.it> wrote:
>>>>> Thanks for the quick reply Bin. Phenix is something I'm going to try for 
>>>>> sure but is seems somehow useless if I can use Spark. 
>>>>> Probably, as you said, since Phoenix use a dedicated data structure 
>>>>> within each HBase Table has a more effective memory usage but if I need 
>>>>> to deserialize data stored in a HBase cell I still have to read in memory 
>>>>> that object and thus I need Spark. From what I understood Phoenix is good 
>>>>> if I have to query a simple column of HBase but things get really 
>>>>> complicated if I have to add an index for each column in my table and I 
>>>>> store complex object within the cells. Is it correct?
>>>>> 
>>>>> Best,
>>>>> Flavio
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Tue, Apr 8, 2014 at 6:05 PM, Bin Wang <binwang...@gmail.com> wrote:
>>>>>> Hi Flavio, 
>>>>>> 
>>>>>> I happened to attend, actually attending the 2014 Apache Conf, I heard a 
>>>>>> project called "Apache Phoenix", which fully leverage HBase and suppose 
>>>>>> to be 1000x faster than Hive. And it is not memory bounded, in which 
>>>>>> case sets up a limit for Spark. It is still in the incubating group and 
>>>>>> the "stats" functions spark has already implemented are still on the 
>>>>>> roadmap. I am not sure whether it will be good but might be something 
>>>>>> interesting to check out.
>>>>>> 
>>>>>> /usr/bin
>>>>>> 
>>>>>> 
>>>>>>> On Tue, Apr 8, 2014 at 9:57 AM, Flavio Pompermaier 
>>>>>>> <pomperma...@okkam.it> wrote:
>>>>>>> Hi to everybody,
>>>>>>> in these days I looked a bit at the recent evolution of the big data 
>>>>>>> stacks and it seems that HBase is somehow fading away in favour of 
>>>>>>> Spark+HDFS. Am I correct? 
>>>>>>> Do you think that Spark and HBase should work together or not?
>>>>>>> 
>>>>>>> Best regards,
>>>>>>> Flavio
>>>>> 
>

Re: Spark and HBase

Reply via email to