bonnahu: How many regions does your table have ? Are they evenly distributed ?
Cheers On Thu, Dec 4, 2014 at 3:34 PM, Jörn Franke <[email protected]> wrote: > Hi, > > What is your cluster setup? How mich memory do you have? How much space > does one row only consisting of the 3 columns consume? Do you run other > stuff in the background? > > Best regards > Am 04.12.2014 23:57 schrieb "bonnahu" <[email protected]>: > >> I am trying to load a large Hbase table into SPARK RDD to run a SparkSQL >> query on the entity. For an entity with about 6 million rows, it will take >> about 35 seconds to load it to RDD. Is it expected? Is there any way to >> shorten the loading process? I have been getting some tips from >> http://hbase.apache.org/book/perf.reading.html to speed up the process, >> e.g., scan.setCaching(cacheSize) and only add the necessary >> attributes/column to scan. I am just wondering if there are other ways to >> improve the speed? >> >> Here is the code snippet: >> >> SparkConf sparkConf = new >> SparkConf().setMaster("spark://url").setAppName("SparkSQLTest"); >> JavaSparkContext jsc = new JavaSparkContext(sparkConf); >> Configuration hbase_conf = HBaseConfiguration.create(); >> hbase_conf.set("hbase.zookeeper.quorum","url"); >> hbase_conf.set("hbase.regionserver.port", "60020"); >> hbase_conf.set("hbase.master", "url"); >> hbase_conf.set(TableInputFormat.INPUT_TABLE, entityName); >> Scan scan = new Scan(); >> scan.addColumn(Bytes.toBytes("MetaInfo"), Bytes.toBytes("col1")); >> scan.addColumn(Bytes.toBytes("MetaInfo"), Bytes.toBytes("col2")); >> scan.addColumn(Bytes.toBytes("MetaInfo"), Bytes.toBytes("col3")); >> scan.setCaching(this.cacheSize); >> hbase_conf.set(TableInputFormat.SCAN, convertScanToString(scan)); >> JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD >> = jsc.newAPIHadoopRDD(hbase_conf, >> TableInputFormat.class, ImmutableBytesWritable.class, >> Result.class); >> logger.info("count is " + hBaseRDD.cache().count()); >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Loading-a-large-Hbase-table-into-SPARK-RDD-takes-quite-long-time-tp20396.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >>
