I think TableInputFormat will try to maintain as much locality as possible,
assigning one Spark partition per region and trying to assign that
partition to a YARN container/executor on the same node (assuming you're
using Spark over YARN). So the reason for the uneven distribution could be
that your HBase is not balanced to begin with and has too many regions on
the same region server corresponding to your largest bar. It all depends on
what HBase balancer you have configured and tuned. Assuming that is
properly configured, try to balance your HBase cluster before running the
Spark job. Tere are command s in hbase shell to do it manually if required.

Hope this helps.

----
Saad


On Sat, May 19, 2018 at 6:40 PM, Alchemist <alchemistsrivast...@gmail.com>
wrote:

> I am trying to parallelize a simple Spark program processes HBASE data in
> parallel.
>
> // Get Hbase RDD
>     JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD = jsc
>             .newAPIHadoopRDD(conf, TableInputFormat.class,
>                     ImmutableBytesWritable.class, Result.class);
>     long count = hBaseRDD.count();
>
> Only two lines I see in the logs.  Zookeeper starts and Zookeeper stops
>
>
> The problem is my program is as SLOW as the largest bar. Found that ZK is 
> taking long time before shutting.
> 18/05/19 17:26:55 INFO zookeeper.ClientCnxn: Session establishment complete 
> on server :2181, sessionid = 0x163662b64eb046d, negotiated timeout = 40000 
> 18/05/19
> 17:38:00 INFO zookeeper.ZooKeeper: Session: 0x163662b64eb046d closed
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

Reply via email to