Thank you so much Serega.


On Sun, Sep 28, 2014 at 11:01 PM, Serega Sheypak <>

> I'm not sure how does Pig HBaseStroage works. I suppose it would read all
> data and then join it as usual dataset. So you should get serious hbase
> perfomace degradation during read, you would get key-by-key read from the
> whole table.
> 1. so join in pig
> 2. At first you load data from hbase table then operate on it. I don't see
> a cse where you can use hbase table directly in join.
> 2014-09-28 17:02 GMT+04:00 Krishna Kalyan <>:
>> We actually have 2 data sets in HDFS, location (3-5 GB, approx 10 columns
>> in each record) and weblog (2-3 TB, approx 50 columns in each record). We
>> need to join the data sets using the locationId, which is in both the
>> data-sets.
>> We have 2 options:
>> 1. Have both the data-sets in HDFS only and JOIN then on locationId, may
>> be using Pig.
>> 2. Since JOIN will be on locaitonId, which is primary key for location
>> data set, if we store the location data set with locationId as rowkey in
>> HBase and then use Pig query to do the join of weblog data set and location
>> table (using HBaseStorage).
>> The reason to think about this idea is reading data based on the key is
>> faster in HBase, however we are not sure that in case of JOIN of 2 data
>> sets, will Pig internally go for picking the individual location record for
>> based on key or it reads through entire or few records from location table
>> and then do the join. Based on this we can make the choice.
>> We are free to use HDFS or HBase for any input or output data set, please
>> advise which option can provide us better performance. Also if required,
>> please point us to some good article on this.
>> On Sun, Sep 28, 2014 at 5:51 PM, Serega Sheypak <
>> > wrote:
>>> store location to hdfs
>>> store weblog to hdfs
>>> join them
>>> use HBase bulk load tool to load join result to hbase.
>>> What's the reason to keep location dataset in hbase and weblogs in hdfs?
>>> You can expect data load perfomance improvement. For me it takes few
>>> minutes to bulk load 500.000.000 records to 10-nodes hbase with presplitted
>>> table.
>>> 2014-09-28 16:04 GMT+04:00 Krishna Kalyan <>:
>>>> Thanks Serega,
>>>> Our usecase details:
>>>> We have a location table which will be stored in HBase with locationID
>>>> as the rowkey / Joinkey.
>>>> We intend to join this table with a transactional WebLog file in HDFS
>>>> (Expected size can be around 2TB).
>>>> Joining query will be passed from Pig.
>>>> Can we expect a performance improvement when compared with mapreduce
>>>> appoach?.
>>>> Regards,
>>>> Krishna
>>>> On Sat, Sep 27, 2014 at 9:13 PM, Serega Sheypak <
>>>>> wrote:
>>>>> Depends on the datasets size and HBase workload. The best way is to do
>>>>> join
>>>>> in pig, store it and then use HBase bulk load tool.
>>>>> It's general recommendation. I have no idea about your task details
>>>>> 2014-09-27 7:32 GMT+04:00 Krishna Kalyan <>:
>>>>> > Hi,
>>>>> > We have a use case that involves ETL on data coming from several
>>>>> different
>>>>> > sources using pig.
>>>>> > We plan to store the final output table in HBase.
>>>>> > What will be the performance impact if we do a join with an external
>>>>> CSV
>>>>> > table using pig?.
>>>>> >
>>>>> > Regards,
>>>>> > Krishna
>>>>> >

Reply via email to