This is a common but hard problem. I do not have a good answer. This issue with doing random reads for each line you are processing is that there's no way to batch them so you're basically doing this:
- Open a socket to a region server - Send the request over the network - The region server seeks in a bunch of files (how many exactly depends on your setup, also the files might not be local, and finally since you setCacheBlocks(false) you're making sure you'll never cache blocks). - The region server sends the answer back on the network - etc However you put it this cannot be fast, it can only be less slow. Also since you are running a MR job your random reads are competing with the long sequential reads coming from your mappers. Instead of doing random reads you could try doing a join which essentially comes down to a scan of your HDFS file plus a scan of your HBase table plus whatever overhead of spilling to disk and whatnot. If the data you are reading from HBase is only a tiny portion of its actual content, doing a full scan might not be the right option. If it's really small then put it in the distributed cache. If you reuse some of those rows then try to cache them in your mappers. Finally kind of like Paul said, if you can emit your rows and somehow batch them reducer-side in order to either do short scans or multi-get (see HTable.get(List<Get>)) it could be faster. Hope this helps a tiny bit, J-D On Tue, Jun 19, 2012 at 1:37 AM, Marcin Cylke <[email protected]> wrote: > Hi > > I've run into some performance issues with my hadoop MapReduce Job. > Basically what I'm doing with it is: > > - read data from HDFS file > - the output goes also to HDFS file (multiple ones in my scenerio) > - in my mapper I process each line and enrich it with some data read > from HBase table (I do Get each time) > - I don't use reducer > > The Get performance seems not that good. On Average it is ~17.5 > gets/second. Peaks are 100gets/sec (which would be desirable speed :)). > The logs are from one node only. and the performance count also. > > My schema is nothing special - one ColumnFamily with 3 columns. But I > heavilly use timestamps. My table looks like this: > > {NAME => 'XYZ', FAMILIES => [{NAME => 'cf', BLOOMFILTER => 'NONE', > REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => > '2147483646', true > TTL => '2147483647', MIN_VERSIONS => '0', BLOCKSIZE => '65536', > IN_MEMORY => 'false', BLOCKCACHE => 'true'}]} > > Look at number of VERSIONS. > > And my GETs are like this: > Get get = new Get(Bytes.toBytes(key)); > get.setMaxVersions(1); > get.setTimeRange(0, timestamp); > get.setCacheBlocks(false); > get.addFamily(Bytes.toBytes("cf")); > Result res = htable.get(get); > > I init that HTable like this: > htable = new HTable(config, QUERY_TABLE_NAME); > htable.setAutoFlush(false); > htable.setWriteBufferSize(1024 * 1024 * 12); > > > I've attached a sample of Get performance - first column is number of > GETs, the second is a date. > > Could You suggest where I'm getting that performance penalty? What to > look at to check if I'm not doing something stupid here, what kind of > statistics? > > Regards > Marcin
