Difference between ResultScanner and initTableMapperJob

S L Mon, 10 Jul 2017 14:22:04 -0700

I hope someone can tell me what the difference between these two API calls
are.  I'm getting weird results between the two of them.  This is happening
for hbase-client/hbase-server version 1.0.1 and 1.2.0-cdh5.7.2.


First off, my rowkeys are in the format hash_name_timestamp
e.g. 100_servername_1234567890.  The hbase table has a TTL of 30 days so
things older than 30 days should disappear after compaction.

The following is code for using ResultScanner.  It doesn't use MapReduce so
it takes a very long time to complete.  I can't run my job this way because
it takes too long.  However, for debugging purposes, I don't have any
problems with this method.  It lists all keys for the specified time range,
which look valid to me since all the timestamps of the returned keys are
within the past 30 days and within the specified time range:

    Scan scan = new Scan();
    scan.addColumn(Bytes.toBytes("raw_data"), Bytes.toBytes(fileType));
    scan.setCaching(500);
    scan.setCacheBlocks(false);
    scan.setTimeRange(start, end);

    Connection fConnection = ConnectionFactory.createConnection(conf);
    Table table = fConnection.getTable(TableName.valueOf(tableName));
    ResultScanner scanner = table.getScanner(scan);
    for (Result result = scanner.next(); result != null; result =
scanner.next()) {
       System.out.println("Found row: " + Bytes.toString(result.getRow()));
    }


The follow code doesn't work but it uses MapReduce, which runs way faster
than using the ResultScanner way, since it divides things up into 1200
maps.  The problem is I'm getting rowkeys that should have disappeared due
to TTL expiring:

    Scan scan = new Scan();
    scan.addColumn(Bytes.toBytes("raw_data"), Bytes.toBytes(fileType));
    scan.setCaching(500);
    scan.setCacheBlocks(false);
    scan.setTimeRange(start, end);
TableMapReduceUtil.initTableMapperJob(tableName, scan, MTTRMapper.class,
Text.class, IntWritable.class, job);

Here is the error that I get, which eventually kills the whole MR job later
because over 25% of the mappers failed.

> Error: org.apache.hadoop.hbase.client.RetriesExhaustedException:
> Failed after attempts=36, exceptions: Wed Jun 28 13:46:57 PDT 2017,
> null, java.net.SocketTimeoutException: callTimeout=120000,
> callDuration=120301: row '65_app129041.iad1.mydomain.com_1476641940'
> on table 'server_based_data' at region=server_based_data

I'll try to study the code for the hbase-client and hbase-server jars but
hopefully someone will know offhand what the difference between the methods
are and what is causing the initTableMapperJob call to fail.

Difference between ResultScanner and initTableMapperJob

Reply via email to