Hello, I see 0.94.5 has already been released, so wondered how can I solve the issue that we have. In more detail we have a table with billions of records. Most of the mapreduce job that we run select from this table records that has a family mk with a given value. For example,
get 'mytable' ,'row1', 'mk' COLUMN CELL mk:_genmrk_ timestamp=1360869679003, value=1360869340-1376304115 mk:_updmrk_ timestamp=1360869376272, value=1360869340-1376304115 mk:dist Map of a mapreduce job goes over all records and checks if _genmrk_ is equal to the given value. So, my question is that is it possible to select all records with mk:_genmrk_ =myvalue and feed them to map of mapreduce job instead of iterating over all records? Thanks in advance. Alex. -----Original Message----- From: Ted Yu <[email protected]> To: user <[email protected]> Sent: Fri, Feb 8, 2013 6:23 pm Subject: Re: split table data into two or more tables See the following javadoc in Scan.java: * To only retrieve columns within a specific range of version timestamps, * execute {@link #setTimeRange(long, long) setTimeRange}. You can search for the above method in unit tests. In your use case, is family f the only family ? If not, take a look at HBASE-5416 which is coming in 0.94.5 family f would be the essential column. Cheers On Fri, Feb 8, 2013 at 5:47 PM, <[email protected]> wrote: > Hi, > > Thanks for suggestions. How a time range scan can be implemented in java > code. Is there any sample code or tutorials? > Also, is it possible to select by a value of a column? Let say I know that > records has family f and column m, and new records has m=5. I need to > instruct hbase to send only these records to the mapper of mapred jobs. > > Thanks. > Alex. > > > > > > > > -----Original Message----- > From: Ted Yu <[email protected]> > To: user <[email protected]> > Sent: Fri, Feb 8, 2013 11:05 am > Subject: Re: split table data into two or more tables > > > bq. in a cluster of 2 nodes +1 master > I assume you're limited by hardware in the regard. > > bq. job selects these new records > Have you used time-range scan ? > > Cheers > > On Fri, Feb 8, 2013 at 10:59 AM, <[email protected]> wrote: > > > Hi, > > > > The rationale is that I have a mapred job that adds new records to an > > hbase table, constantly. > > The next mapred job selects these new records, but it must iterate over > > all records and check if it is a candidate for selection. > > Since there are too many old records iterating though them in a cluster > of > > 2 nodes +1 master takes about 2 days. So I thought, splitting them into > two > > tables must reduce this time, and as soon as I figure out that there is > no > > more new record left in one of the new tables I will not run mapred job > on > > it. > > > > Currently, we have 7 regions including ROOT and META. > > > > > > Thanks. > > Alex. > > > > > > > > > > > > > > -----Original Message----- > > From: Ted Yu <[email protected]> > > To: user <[email protected]> > > Sent: Fri, Feb 8, 2013 10:40 am > > Subject: Re: split table data into two or more tables > > > > > > May I ask the rationale behind this ? > > Were you aiming for higher write throughput ? > > > > Please also tell us how many regions you have in the current table. > > > > Thanks > > > > BTW please consider upgrading to 0.94.4 > > > > On Fri, Feb 8, 2013 at 10:36 AM, <[email protected]> wrote: > > > > > Hello, > > > > > > I wondered if there is a way of splitting data from one table into two > or > > > more tables in hbase with iidentical schemas, i.e. if table A has 100M > > > records put 50M into table B, 50M into table C and delete table A. > > > Currently, I use hbase-0.92.1 and hadoop-1.4.0 > > > > > > Thanks. > > > Alex. > > > > > > > > > > > >
