Re: Bulk-load to HBase

Aniket Bhatnagar Fri, 19 Sep 2014 05:42:14 -0700

Agreed that the bulk import would be faster. In my case, I wasn't expecting
a lot of data to be uploaded to HBase and also, I didn't want to take the
pain of importing generated HFiles into HBase. Is there a way to invoke
HBase HFile import batch script programmatically?


On 19 September 2014 17:58, innowireless TaeYun Kim <
taeyun....@innowireless.co.kr> wrote:

> In fact, it seems that Put can be used by HFileOutputFormat, so Put object
> itself may not be the problem.
>
> The problem is that TableOutputFormat uses the Put object in the normal
> way (that goes through normal write path), while HFileOutFormat uses it to
> directly build the HFile.
>
>
>
> *From:* innowireless TaeYun Kim [mailto:taeyun....@innowireless.co.kr]
> *Sent:* Friday, September 19, 2014 9:20 PM
>
> *To:* user@spark.apache.org
> *Subject:* RE: Bulk-load to HBase
>
>
>
> Thank you for the example code.
>
>
>
> Currently I use foreachPartition() + Put(), but your example code can be
> used to clean up my code.
>
>
>
> BTW, since the data uploaded by Put() goes through normal HBase write
> path, it can be slow.
>
> So, it would be nice if bulk-load could be used, since it bypasses the
> write path.
>
>
>
> Thanks.
>
>
>
> *From:* Aniket Bhatnagar [mailto:aniket.bhatna...@gmail.com
> <aniket.bhatna...@gmail.com>]
> *Sent:* Friday, September 19, 2014 9:01 PM
> *To:* innowireless TaeYun Kim
> *Cc:* user
> *Subject:* Re: Bulk-load to HBase
>
>
>
> I have been using saveAsNewAPIHadoopDataset but I use TableOutputFormat
> instead of HFileOutputFormat. But, hopefully this should help you:
>
>
>
> val hbaseZookeeperQuorum =
> s"$zookeeperHost:$zookeeperPort:$zookeeperHbasePath"
>
> val conf = HBaseConfiguration.create()
>
> conf.set("hbase.zookeeper.quorum", hbaseZookeeperQuorum)
>
> conf.set(TableOutputFormat.QUORUM_ADDRESS, hbaseZookeeperQuorum)
>
> conf.set(TableOutputFormat.QUORUM_PORT, zookeeperPort.toString)
>
> conf.setClass("mapreduce.outputformat.class",
> classOf[TableOutputFormat[Object]], classOf[OutputFormat[Object, Writable]])
>
> conf.set(TableOutputFormat.OUTPUT_TABLE, tableName)
>
>
>
> val rddToSave: RDD[(Array[Byte], Array[Byte], Array[Byte])] = ... // Some
> RDD that contains row key, column qualifier and data
>
>
>
> val putRDD = rddToSave.map(tuple => {
>
>     val (rowKey, column data) = tuple
>
>     val put: Put = new Put(rowKey)
>
>     put.add(COLUMN_FAMILY_RAW_DATA_BYTES, column, data)
>
>
>
>     (new ImmutableBytesWritable(rowKey), put)
>
> })
>
>
>
> putRDD.saveAsNewAPIHadoopDataset(conf)
>
>
>
>
>
> On 19 September 2014 16:52, innowireless TaeYun Kim <
> taeyun....@innowireless.co.kr> wrote:
>
> Hi,
>
>
>
> Sorry, I just found saveAsNewAPIHadoopDataset.
>
> Then, Can I use HFileOutputFormat with saveAsNewAPIHadoopDataset? Is there
> any example code for that?
>
>
>
> Thanks.
>
>
>
> *From:* innowireless TaeYun Kim [mailto:taeyun....@innowireless.co.kr]
> *Sent:* Friday, September 19, 2014 8:18 PM
> *To:* user@spark.apache.org
> *Subject:* RE: Bulk-load to HBase
>
>
>
> Hi,
>
>
>
> After reading several documents, it seems that saveAsHadoopDataset cannot
> use HFileOutputFormat.
>
> It’s because saveAsHadoopDataset method uses JobConf, so it belongs to
> the old Hadoop API, while HFileOutputFormat is a member of mapreduce
> package which is for the new Hadoop API.
>
>
>
> Am I right?
>
> If so, is there another method to bulk-load to HBase from RDD?
>
>
>
> Thanks.
>
>
>
> *From:* innowireless TaeYun Kim [mailto:taeyun....@innowireless.co.kr
> <taeyun....@innowireless.co.kr>]
> *Sent:* Friday, September 19, 2014 7:17 PM
> *To:* user@spark.apache.org
> *Subject:* Bulk-load to HBase
>
>
>
> Hi,
>
>
>
> Is there a way to bulk-load to HBase from RDD?
>
> HBase offers HFileOutputFormat class for bulk loading by MapReduce job,
> but I cannot figure out how to use it with saveAsHadoopDataset.
>
>
>
> Thanks.
>
>
>

Re: Bulk-load to HBase

Reply via email to