I have a large table that I am running a map reduce job on. The job scans for a 
particular column value in the table using a TableInputFormat with a filter on 
the scan. This value only matches a few rows, so most of the rows are filtered 
out.

The problem is that the TableInputFormat  will not report status back to the 
task tracker until the regionserver sends back a row matching the filter. If 
there are only few matching rows, and the table is very large, it can take a 
while for a row to come back from the regionserver. This can result in a task 
tracker timeout. The problem is exacerbated with large region file sizes.

I can sort of work around this by increasing the mapred.task.timeout property, 
but that doesn't seem very optimal. The other solution would be to not use a 
filter, and to filter out rows in the map reduce job, which would increase I/O. 
Any other solutions? It seems the TableInputFormat shouldn't wait for the 
regionserver to report back status to the task tracker.

Reply via email to