Re: reading Hbase table in Spark

Ted Yu Mon, 10 Oct 2016 14:59:07 -0700

In that case I suggest polling user@hive to see if someone has done this.

Thanks


On Mon, Oct 10, 2016 at 2:56 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Thanks I am on Spark 2 so may not be feasible.
>
> As a mater of interest how about using Hive on top of Hbase table?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> OABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 10 October 2016 at 22:49, Ted Yu <yuzhih...@gmail.com> wrote:
>
> > In hbase master branch, there is hbase-spark module which would allow you
> > to integrate with Spark seamlessly.
> >
> > Note: support for Spark 2.0 is pending. For details, see HBASE-16179
> >
> > Cheers
> >
> > On Mon, Oct 10, 2016 at 2:46 PM, Mich Talebzadeh <
> > mich.talebza...@gmail.com>
> > wrote:
> >
> > > Thanks Ted,
> > >
> > > So basically involves Java programming much like JDBC connection
> > retrieval
> > > etc.
> > >
> > > Writing to Hbase is pretty fast. Now I have both views in Phoenix and
> > Hive
> > > on the underlying Hbase tables.
> > >
> > > I am looking for flexibility here so I get I should use Spark on Hive
> > > tables with a view on Hbase table.
> > >
> > > Also I like tools like Zeppelin that work with both SQL and Spark
> > > Functional programming.
> > >
> > > Sounds like reading data from Hbase table is best done through some
> form
> > of
> > > SQL.
> > >
> > > What are view on this approach?
> > >
> > >
> > >
> > > Dr Mich Talebzadeh
> > >
> > >
> > >
> > > LinkedIn * https://www.linkedin.com/profile/view?id=
> > > AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > > <https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> > > OABUrV8Pw>*
> > >
> > >
> > >
> > > http://talebzadehmich.wordpress.com
> > >
> > >
> > > *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any
> > > loss, damage or destruction of data or any other property which may
> arise
> > > from relying on this email's technical content is explicitly
> disclaimed.
> > > The author will in no case be liable for any monetary damages arising
> > from
> > > such loss, damage or destruction.
> > >
> > >
> > >
> > > On 10 October 2016 at 22:13, Ted Yu <yuzhih...@gmail.com> wrote:
> > >
> > > > For org.apache.hadoop.hbase.client.Result, there is this method:
> > > >
> > > >   public byte[] getValue(byte [] family, byte [] qualifier) {
> > > >
> > > > which allows you to retrieve value for designated column.
> > > >
> > > >
> > > > FYI
> > > >
> > > > On Mon, Oct 10, 2016 at 2:08 PM, Mich Talebzadeh <
> > > > mich.talebza...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am trying to do some operation on an Hbase table that is being
> > > > populated
> > > > > by Spark Streaming.
> > > > >
> > > > > Now this is just Spark on Hbase as opposed to Spark on Hive -> view
> > on
> > > > > Hbase etc. I also have Phoenix view on this Hbase table.
> > > > >
> > > > > This is sample code
> > > > >
> > > > > scala>     val tableName = "marketDataHbase"
> > > > > >     val conf = HBaseConfiguration.create()
> > > > > conf: org.apache.hadoop.conf.Configuration = Configuration:
> > > > > core-default.xml, core-site.xml, mapred-default.xml,
> mapred-site.xml,
> > > > > yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml,
> > > > > hbase-default.xml, hbase-site.xml
> > > > > scala>     conf.set(TableInputFormat.INPUT_TABLE, tableName)
> > > > > scala>         //create rdd
> > > > > scala>
> > > > > *val hBaseRDD = sc.newAPIHadoopRDD(conf,
> > > > > classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io
> > > > > <http://hbase.io>.ImmutableBytesWritable],
> classOf[org.apache.hadoop.
> > > > > hbase.client.Result])*hBaseRDD:
> > > > > org.apache.spark.rdd.RDD[(org.apache.hadoop.hbase.io.
> > > > > ImmutableBytesWritable,
> > > > > org.apache.hadoop.hbase.client.Result)] = NewHadoopRDD[4] at
> > > > > newAPIHadoopRDD at <console>:64
> > > > > scala> hBaseRDD.count
> > > > > res11: Long = 22272
> > > > >
> > > > > scala>     // transform (ImmutableBytesWritable, Result) tuples
> into
> > an
> > > > RDD
> > > > > of Result's
> > > > > scala> val resultRDD = hBaseRDD.map(tuple => tuple._2)
> > > > > resultRDD: org.apache.spark.rdd.RDD[org.
> apache.hadoop.hbase.client.
> > > > Result]
> > > > > = MapPartitionsRDD[8] at map at <console>:41
> > > > >
> > > > > scala>  // transform into an RDD of (RowKey, ColumnValue)s  the
> > RowKey
> > > > has
> > > > > the time removed
> > > > >
> > > > > scala> val keyValueRDD = resultRDD.map(result =>
> > > > > (Bytes.toString(result.getRow()).split(" ")(0),
> > > > > Bytes.toString(result.value)))
> > > > > keyValueRDD: org.apache.spark.rdd.RDD[(String, String)] =
> > > > > MapPartitionsRDD[9] at map at <console>:43
> > > > >
> > > > > scala> keyValueRDD.take(2).foreach(kv => println(kv))
> > > > > (000055e2-63f1-4def-b625-e73f0ac36271,43.89760813529593664528)
> > > > > (000151e9-ff27-493d-a5ca-288507d92f95,57.68882040742382868990)
> > > > >
> > > > > OK above I am only getting the rowkey (UUID above) and the last
> > > > > attribute (price).
> > > > > However, I have the rowkey and 3 more columns there in Hbase table!
> > > > >
> > > > > scan 'marketDataHbase', "LIMIT" => 1
> > > > > ROW                                                   COLUMN+CELL
> > > > >  000055e2-63f1-4def-b625-e73f0ac36271
> > > > > column=price_info:price, timestamp=1476133232864,
> > > > > value=43.89760813529593664528
> > > > >  000055e2-63f1-4def-b625-e73f0ac36271
> > > > > column=price_info:ticker, timestamp=1476133232864, value=S08
> > > > >  000055e2-63f1-4def-b625-e73f0ac36271
> > > > > column=price_info:timecreated, timestamp=1476133232864,
> > > > > value=2016-10-10T17:12:22
> > > > > 1 row(s) in 0.0100 seconds
> > > > > So how can I get the other columns?
> > > > >
> > > > > Thanks
> > > > >
> > > > >
> > > > > Dr Mich Talebzadeh
> > > > >
> > > > >
> > > > >
> > > > > LinkedIn * https://www.linkedin.com/profile/view?id=
> > > > > AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > > > > <https://www.linkedin.com/profile/view?id=
> > > AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> > > > > OABUrV8Pw>*
> > > > >
> > > > >
> > > > >
> > > > > http://talebzadehmich.wordpress.com
> > > > >
> > > > >
> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility
> for
> > > any
> > > > > loss, damage or destruction of data or any other property which may
> > > arise
> > > > > from relying on this email's technical content is explicitly
> > > disclaimed.
> > > > > The author will in no case be liable for any monetary damages
> arising
> > > > from
> > > > > such loss, damage or destruction.
> > > > >
> > > >
> > >
> >
>

Re: reading Hbase table in Spark

Reply via email to