Thanks Ted, Do you see a good approach for calculating size of HBase Table and telling it to Spark and letting it decide which type of join it should perform. As far as I have understood, from HBaseAdmin we calculate number of regions and for each region we compute Hfiles and get their weight [0].
[0]: https://github.com/apache/hbase/blob/rel/1.1.2/hbase-server/src/main/java/org/apache/hadoop/hbase/HDFSBlocksDistribution.java On Fri, Jul 22, 2016 at 4:24 AM, Ted Yu <[email protected]> wrote: > Please take a look at the following methods: > > From HBaseAdmin: > > public List<HRegionInfo> getTableRegions(final TableName tableName) > > From HRegion: > > public static HDFSBlocksDistribution computeHDFSBlocksDistribution(final > Configuration conf, > > final HTableDescriptor tableDescriptor, final HRegionInfo regionInfo) > throws IOException { > > FYI > > On Wed, Jul 20, 2016 at 11:28 PM, Sachin Jain <[email protected]> > wrote: > > > *Context* > > I am using Spark (1.5.1) with HBase (1.1.2) to dump the output of Spark > > Jobs into HBase which will be further available as lookups from HBase > > Table. BaseRelation extends HadoopFSRelation and is used to read and > write > > to HBase. Spark Default Source API is used. > > > > *Use Case* > > Now, whenever I perform join operation, Spark creates a logical plan and > > decides which type of join it should execute and as per Spark Strategies > > [0] it checks the size of HBase Table. If it is less than some threshold > > (10 MB) it selects Broadcast Hash join otherwise Sort Merge join. > > > > *Problem Statement* > > I want to know if there is an API or some approach to calculate the size > of > > an HBase table. > > > > [0]: > > > > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L118 > > > > Thanks > > -Sachin > > >
