Agreed that the bulk import would be faster. In my case, I wasn't expecting a lot of data to be uploaded to HBase and also, I didn't want to take the pain of importing generated HFiles into HBase. Is there a way to invoke HBase HFile import batch script programmatically?
On 19 September 2014 17:58, innowireless TaeYun Kim < taeyun....@innowireless.co.kr> wrote: > In fact, it seems that Put can be used by HFileOutputFormat, so Put object > itself may not be the problem. > > The problem is that TableOutputFormat uses the Put object in the normal > way (that goes through normal write path), while HFileOutFormat uses it to > directly build the HFile. > > > > *From:* innowireless TaeYun Kim [mailto:taeyun....@innowireless.co.kr] > *Sent:* Friday, September 19, 2014 9:20 PM > > *To:* user@spark.apache.org > *Subject:* RE: Bulk-load to HBase > > > > Thank you for the example code. > > > > Currently I use foreachPartition() + Put(), but your example code can be > used to clean up my code. > > > > BTW, since the data uploaded by Put() goes through normal HBase write > path, it can be slow. > > So, it would be nice if bulk-load could be used, since it bypasses the > write path. > > > > Thanks. > > > > *From:* Aniket Bhatnagar [mailto:aniket.bhatna...@gmail.com > <aniket.bhatna...@gmail.com>] > *Sent:* Friday, September 19, 2014 9:01 PM > *To:* innowireless TaeYun Kim > *Cc:* user > *Subject:* Re: Bulk-load to HBase > > > > I have been using saveAsNewAPIHadoopDataset but I use TableOutputFormat > instead of HFileOutputFormat. But, hopefully this should help you: > > > > val hbaseZookeeperQuorum = > s"$zookeeperHost:$zookeeperPort:$zookeeperHbasePath" > > val conf = HBaseConfiguration.create() > > conf.set("hbase.zookeeper.quorum", hbaseZookeeperQuorum) > > conf.set(TableOutputFormat.QUORUM_ADDRESS, hbaseZookeeperQuorum) > > conf.set(TableOutputFormat.QUORUM_PORT, zookeeperPort.toString) > > conf.setClass("mapreduce.outputformat.class", > classOf[TableOutputFormat[Object]], classOf[OutputFormat[Object, Writable]]) > > conf.set(TableOutputFormat.OUTPUT_TABLE, tableName) > > > > val rddToSave: RDD[(Array[Byte], Array[Byte], Array[Byte])] = ... // Some > RDD that contains row key, column qualifier and data > > > > val putRDD = rddToSave.map(tuple => { > > val (rowKey, column data) = tuple > > val put: Put = new Put(rowKey) > > put.add(COLUMN_FAMILY_RAW_DATA_BYTES, column, data) > > > > (new ImmutableBytesWritable(rowKey), put) > > }) > > > > putRDD.saveAsNewAPIHadoopDataset(conf) > > > > > > On 19 September 2014 16:52, innowireless TaeYun Kim < > taeyun....@innowireless.co.kr> wrote: > > Hi, > > > > Sorry, I just found saveAsNewAPIHadoopDataset. > > Then, Can I use HFileOutputFormat with saveAsNewAPIHadoopDataset? Is there > any example code for that? > > > > Thanks. > > > > *From:* innowireless TaeYun Kim [mailto:taeyun....@innowireless.co.kr] > *Sent:* Friday, September 19, 2014 8:18 PM > *To:* user@spark.apache.org > *Subject:* RE: Bulk-load to HBase > > > > Hi, > > > > After reading several documents, it seems that saveAsHadoopDataset cannot > use HFileOutputFormat. > > It’s because saveAsHadoopDataset method uses JobConf, so it belongs to > the old Hadoop API, while HFileOutputFormat is a member of mapreduce > package which is for the new Hadoop API. > > > > Am I right? > > If so, is there another method to bulk-load to HBase from RDD? > > > > Thanks. > > > > *From:* innowireless TaeYun Kim [mailto:taeyun....@innowireless.co.kr > <taeyun....@innowireless.co.kr>] > *Sent:* Friday, September 19, 2014 7:17 PM > *To:* user@spark.apache.org > *Subject:* Bulk-load to HBase > > > > Hi, > > > > Is there a way to bulk-load to HBase from RDD? > > HBase offers HFileOutputFormat class for bulk loading by MapReduce job, > but I cannot figure out how to use it with saveAsHadoopDataset. > > > > Thanks. > > >