Thank you so much Serega. Regards, Krishna
On Sun, Sep 28, 2014 at 11:01 PM, Serega Sheypak <serega.shey...@gmail.com> wrote: > > https://pig.apache.org/docs/r0.11.0/api/org/apache/pig/backend/hadoop/hbase/HBaseStorage.html > I'm not sure how does Pig HBaseStroage works. I suppose it would read all > data and then join it as usual dataset. So you should get serious hbase > perfomace degradation during read, you would get key-by-key read from the > whole table. > 1. so join in pig > 2. At first you load data from hbase table then operate on it. I don't see > a cse where you can use hbase table directly in join. > > > 2014-09-28 17:02 GMT+04:00 Krishna Kalyan <krishnakaly...@gmail.com>: > >> >> We actually have 2 data sets in HDFS, location (3-5 GB, approx 10 columns >> in each record) and weblog (2-3 TB, approx 50 columns in each record). We >> need to join the data sets using the locationId, which is in both the >> data-sets. >> >> We have 2 options: >> 1. Have both the data-sets in HDFS only and JOIN then on locationId, may >> be using Pig. >> 2. Since JOIN will be on locaitonId, which is primary key for location >> data set, if we store the location data set with locationId as rowkey in >> HBase and then use Pig query to do the join of weblog data set and location >> table (using HBaseStorage). >> >> The reason to think about this idea is reading data based on the key is >> faster in HBase, however we are not sure that in case of JOIN of 2 data >> sets, will Pig internally go for picking the individual location record for >> based on key or it reads through entire or few records from location table >> and then do the join. Based on this we can make the choice. >> >> We are free to use HDFS or HBase for any input or output data set, please >> advise which option can provide us better performance. Also if required, >> please point us to some good article on this. >> >> >> On Sun, Sep 28, 2014 at 5:51 PM, Serega Sheypak <serega.shey...@gmail.com >> > wrote: >> >>> store location to hdfs >>> store weblog to hdfs >>> join them >>> use HBase bulk load tool to load join result to hbase. >>> >>> What's the reason to keep location dataset in hbase and weblogs in hdfs? >>> >>> You can expect data load perfomance improvement. For me it takes few >>> minutes to bulk load 500.000.000 records to 10-nodes hbase with presplitted >>> table. >>> >>> 2014-09-28 16:04 GMT+04:00 Krishna Kalyan <krishnakaly...@gmail.com>: >>> >>>> Thanks Serega, >>>> >>>> Our usecase details: >>>> We have a location table which will be stored in HBase with locationID >>>> as the rowkey / Joinkey. >>>> We intend to join this table with a transactional WebLog file in HDFS >>>> (Expected size can be around 2TB). >>>> Joining query will be passed from Pig. >>>> Can we expect a performance improvement when compared with mapreduce >>>> appoach?. >>>> >>>> Regards, >>>> Krishna >>>> >>>> On Sat, Sep 27, 2014 at 9:13 PM, Serega Sheypak < >>>> serega.shey...@gmail.com> wrote: >>>> >>>>> Depends on the datasets size and HBase workload. The best way is to do >>>>> join >>>>> in pig, store it and then use HBase bulk load tool. >>>>> It's general recommendation. I have no idea about your task details >>>>> >>>>> 2014-09-27 7:32 GMT+04:00 Krishna Kalyan <krishnakaly...@gmail.com>: >>>>> >>>>> > Hi, >>>>> > We have a use case that involves ETL on data coming from several >>>>> different >>>>> > sources using pig. >>>>> > We plan to store the final output table in HBase. >>>>> > What will be the performance impact if we do a join with an external >>>>> CSV >>>>> > table using pig?. >>>>> > >>>>> > Regards, >>>>> > Krishna >>>>> > >>>>> >>>> >>>> >>> >> >