Are the regions for this table evenly spread across nodes in your cluster ?
Were region servers under (heavy) load when your job ran ? Cheers On Mon, Sep 29, 2014 at 7:21 PM, Tao Xiao <xiaotao.cs....@gmail.com> wrote: > I submitted a job in Yarn-Client mode, which simply reads from a HBase > table containing tens of millions of records and then does a *count *action. > The job runs for a much longer time than I expected, so I wonder whether it > was because the data to read was too much. Actually, there are 20 nodes in > my Hadoop cluster so the HBase table seems not so big (tens of millopns of > records). : > > I'm using CDH 5.0.0 (Spark 0.9 and HBase 0.96). > > BTW, when the job was running, I can see logs on the console, and > specifically I'd like to know what the following log means: > > 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Starting task 0.0:20 as > TID 20 on executor 2: b04.jsepc.com (PROCESS_LOCAL) > 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Serialized task 0.0:20 as > 13454 bytes in 0 ms > 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Finished TID 19 in 16426 > ms on b04.jsepc.com (progress: 18/86) > 14/09/30 09:45:20 INFO scheduler.DAGScheduler: Completed ResultTask(0, 19) > > > Thanks >