Task tracker timeout with filtered table scan

Bryan Keller Thu, 31 May 2012 09:27:58 -0700

I have a large table that I am running a map reduce job on. The job scans for a 
particular column value in the table using a TableInputFormat with a filter on 
the scan. This value only matches a few rows, so most of the rows are filtered 
out.


The problem is that the TableInputFormat  will not report status back to the 
task tracker until the regionserver sends back a row matching the filter. If 
there are only few matching rows, and the table is very large, it can take a 
while for a row to come back from the regionserver. This can result in a task 
tracker timeout. The problem is exacerbated with large region file sizes.

I can sort of work around this by increasing the mapred.task.timeout property, 
but that doesn't seem very optimal. The other solution would be to not use a 
filter, and to filter out rows in the map reduce job, which would increase I/O. 
Any other solutions? It seems the TableInputFormat shouldn't wait for the 
regionserver to report back status to the task tracker.

Task tracker timeout with filtered table scan

Reply via email to