I think TableInputFormat will try to maintain as much locality as possible, assigning one Spark partition per region and trying to assign that partition to a YARN container/executor on the same node (assuming you're using Spark over YARN). So the reason for the uneven distribution could be that your HBase is not balanced to begin with and has too many regions on the same region server corresponding to your largest bar. It all depends on what HBase balancer you have configured and tuned. Assuming that is properly configured, try to balance your HBase cluster before running the Spark job. Tere are command s in hbase shell to do it manually if required.
Hope this helps. ---- Saad On Sat, May 19, 2018 at 6:40 PM, Alchemist <alchemistsrivast...@gmail.com> wrote: > I am trying to parallelize a simple Spark program processes HBASE data in > parallel. > > // Get Hbase RDD > JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD = jsc > .newAPIHadoopRDD(conf, TableInputFormat.class, > ImmutableBytesWritable.class, Result.class); > long count = hBaseRDD.count(); > > Only two lines I see in the logs. Zookeeper starts and Zookeeper stops > > > The problem is my program is as SLOW as the largest bar. Found that ZK is > taking long time before shutting. > 18/05/19 17:26:55 INFO zookeeper.ClientCnxn: Session establishment complete > on server :2181, sessionid = 0x163662b64eb046d, negotiated timeout = 40000 > 18/05/19 > 17:38:00 INFO zookeeper.ZooKeeper: Session: 0x163662b64eb046d closed > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >