Hi Esteban, Thanks for sharing ideas.
We are on Hbase 0.96 and java 1.6. I have enabled short-circuit read, and heap size is around 16G for each region server. We have about 20 of them. The list of rowkeys that I need to process is about 10M. I am using batch gets already and the batch size is ~2000 gets. thomas On Thu, Aug 14, 2014 at 11:01 AM, Esteban Gutierrez <[email protected]> wrote: > Hello Thomas, > > What version of HBase are you using? sorting and grouping based on the > regions the rows is going to help for sure. I don't think you should focus > too much in the locality side of the problem unless your HDFS input set is > too large (100s or 1000s of MBs per task), otherwise it might be faster to > load in-memory the input dataset and do the batched calls. As discussed in > this mailing list recently there are too many factors that might be > involved in the performance: number of threads or tasks, size of the row, > RS resources, configurations, etc. so any additional info would be very > helpful. > > cheers, > esteban. > > > > > -- > Cloudera, Inc. > > > > On Thu, Aug 14, 2014 at 10:32 AM, Thomas Kwan <[email protected]> > wrote: > >> Hi there >> >> I have a use-case where I need to do a read to check if a hbase entry >> is present, then I do a put to create the entry when it is not there. >> >> I have a script to get a list of rowkeys from hive and put them on a >> HDFS directory. Then I have a MR job that reads the rowkeys and do >> batch reads. I am getting around 1.5K requests per second. >> >> To attempt to make this faster, I am wondering if I can >> >> - sort and group the rowkeys based on regions >> - make the MR jobs run on regions that have the data locally >> >> Scan or TableInputFormat must have some codes to do something similar >> right? >> >> thanks >> thomas >>
