Re: Accessing Hbase tables through Spark, this seems to work

Jörn Franke Tue, 18 Oct 2016 08:53:28 -0700

Careful Hbase with Phoenix is only in certain scenarios faster. When it is 
about processing small amounts out of a bigger amount of data (depends on node 
memory, the operation etc).  Hive+tez+orc can  be rather competitive, llap 
makes sense for interactive ad-hoc queries that are rather similar. Both 
Phoenix and hive follow different purposes with a different architecture and 
underlying data structure.


> On 18 Oct 2016, at 07:44, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> yes Hive external table is partitioned on a daily basis (datestamp below)
> 
> CREATE EXTERNAL TABLE IF NOT EXISTS ${DATABASE}.externalMarketData (
>      KEY string
>    , SECURITY string
>    , TIMECREATED string
>    , PRICE float
> )
> COMMENT 'From prices Kakfa delivered by Flume location by day'
> ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> STORED AS TEXTFILE
> LOCATION 'hdfs://rhes564:9000/data/prices/'
> --TBLPROPERTIES ("skip.header.line.count"="1")
> ;
> ALTER TABLE ${DATABASE}.externalMarketData set location 
> 'hdfs://rhes564:9000/data/prices/${TODAY}';
> 
> and there is insert/overwrite into managed table every 15 minutes.
> 
> INSERT OVERWRITE TABLE ${DATABASE}.marketData PARTITION (DateStamp = 
> "${TODAY}")
> SELECT
>       KEY
>     , SECURITY
>     , TIMECREATED
>     , PRICE
>     , 1
>     , CAST(from_unixtime(unix_timestamp()) AS timestamp)
> FROM ${DATABASE}.externalMarketData
> 
> That works fine. However, Hbase is much faster for data retrieval with phoenix
> 
> When we get Hive with LLAP, I gather Hive will replace Hbase.
> 
> So in summary we have
> 
> raw data delivered to HDFS
> data ingested into Hbase via cron
> HDFS directory is mapped to Hive external table
> There is Hive managed table with added optimisation/indexing (ORC)
> 
> There are a number of ways of doing it as usual.
> 
> Thanks
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 18 October 2016 at 00:48, ayan guha <guha.a...@gmail.com> wrote:
>> I do not see a rationale to have hbase in this scheme of things....may be I 
>> am missing something? 
>> 
>> If data is delivered in HDFS, why not just add partition to an existing Hive 
>> table? 
>> 
>>> On Tue, Oct 18, 2016 at 8:23 AM, Mich Talebzadeh 
>>> <mich.talebza...@gmail.com> wrote:
>>> Thanks Mike,
>>>  
>>> My test csv data comes as
>>>  
>>> UUID,                                 ticker,  timecreated,          price
>>> a2c844ed-137f-4820-aa6e-c49739e46fa6, S01,     2016-10-17T22:02:09,  
>>> 53.36665625650533484995
>>> a912b65e-b6bc-41d4-9e10-d6a44ea1a2b0, S02,     2016-10-17T22:02:09,  
>>> 86.31917515824627016510
>>> 5f4e3a9d-05cc-41a2-98b3-40810685641e, S03,     2016-10-17T22:02:09,  
>>> 95.48298277703729129559
>>>  
>>>  
>>> And this is my Hbase table with one column family
>>>  
>>> create 'marketDataHbase', 'price_info'
>>>  
>>> It is populated every 15 minutes from test.csv files delivered via Kafka 
>>> and Flume to HDFS
>>>  
>>> Create a fat csv file based on all small csv files for today --> 
>>> prices/2016-10-17
>>> Populate data into Hbase table using 
>>> org.apache.hadoop.hbase.mapreduce.ImportTsv
>>> This is pretty quick using MapReduce
>>>  
>>> That importTsv only appends new rows to Hbase table as the choice of UUID 
>>> as rowKey avoids any issues.
>>>  
>>> So I only have 15 minutes lag in my batch Hbase table.
>>>  
>>> I have both Hive ORC tables and Phoenix views on top of this Hbase tables.
>>>  
>>> Phoenix offers the fastest response if used on top of Hbase. unfortunately, 
>>> Spark 2 with Phoenix is broken
>>> Spark on Hive on Hbase looks OK. This works fine with Spark 2
>>> Spark on Hbase tables directly using key, value DFs for each column. Not as 
>>> fast as 2 but works. I don't think a DF is a good choice for a key, value 
>>> pair?
>>> Now if I use Zeppelin to read from Hbase
>>>  
>>> I can use Phoenix JDBC. That looks very fast
>>> I can use Spark csv directly on HDFS csv files.
>>> I can use Spark on Hive tables
>>>  
>>> If I use Tableau on Hbase data then, only sql like code is useful. Phoenix 
>>> or Hive
>>>  
>>> I don't want to change the design now. But admittedly Hive is the best SQL 
>>> on top of Hbase. Next release of Hive is going to have in-memory database 
>>> (LLAP) so we can cache Hive tables in memory. That will be faster. Many 
>>> people underestimate Hive but I still believe it has a lot to offer besides 
>>> serious ANSI compliant SQL.
>>>  
>>> Regards
>>>  
>>>  Mich
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction.
>>>  
>>> 
>>>> On 17 October 2016 at 21:54, Michael Segel <msegel_had...@hotmail.com> 
>>>> wrote:
>>>> Mitch, 
>>>> 
>>>> Short answer… no, it doesn’t scale. 
>>>> 
>>>> Longer answer… 
>>>> 
>>>> You are using an UUID as the row key?  Why?  (My guess is that you want to 
>>>> avoid hot spotting)
>>>> 
>>>> So you’re going to have to pull in all of the data… meaning a full table 
>>>> scan… and then perform a sort order transformation, dropping the UUID in 
>>>> the process. 
>>>> 
>>>> You would be better off not using HBase and storing the data in Parquet 
>>>> files in a directory partitioned on date.  Or rather the rowkey would be 
>>>> the max_ts - TS so that your data is in LIFO. 
>>>> Note: I’ve used the term epoch to describe the max value of a long (8 
>>>> bytes of ‘FF’ ) for the max_ts. This isn’t a good use of the term epoch, 
>>>> but if anyone has a better term, please let me know.
>>>> 
>>>> 
>>>> 
>>>> Having said that… if you want to use HBase, you could do the same thing.  
>>>> If you want to avoid hot spotting, you could load the day’s transactions 
>>>> using a bulk loader so that you don’t have to worry about splits. 
>>>> 
>>>> But that’s just my $0.02 cents worth. 
>>>> 
>>>> HTH
>>>> 
>>>> -Mike
>>>> 
>>>> PS. If you wanted to capture the transactions… you could do the following 
>>>> schemea:
>>>> 
>>>> 1) Rowkey = max_ts - TS  
>>>> 2) Rows contain the following:
>>>> CUSIP (Transaction ID)
>>>> Party 1 (Seller)
>>>> Party 2 (Buyer)
>>>> Symbol
>>>> Qty
>>>> Price
>>>> 
>>>> This is a trade ticket. 
>>>> 
>>>> 
>>>> 
>>>>> On Oct 16, 2016, at 1:37 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
>>>>> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I have trade data stored in Hbase table. Data arrives in csv format to 
>>>>> HDFS and then loaded into Hbase via periodic load with 
>>>>> org.apache.hadoop.hbase.mapreduce.ImportTsv.
>>>>> 
>>>>> The Hbase table has one Column family "trade_info" and three columns: 
>>>>> ticker, timecreated, price.
>>>>> 
>>>>> The RowKey is UUID. So each row has UUID, ticker, timecreated and price 
>>>>> in the csv file
>>>>> 
>>>>> Each row in Hbase is a key, value map. In my case, I have one Column 
>>>>> Family and three columns. Without going into semantics I see Hbase as a 
>>>>> column oriented database where column data stay together.
>>>>> 
>>>>> So I thought of this way of accessing the data.
>>>>> 
>>>>> I define an RDD for each column in the column family as below. In this 
>>>>> case column trade_info:ticker
>>>>> 
>>>>> //create rdd
>>>>> val hBaseRDD = sc.newAPIHadoopRDD(conf, 
>>>>> classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],classOf[org.apache.hadoop.hbase.client.Result])
>>>>> val rdd1 = hBaseRDD.map(tuple => tuple._2).map(result => (result.getRow, 
>>>>> result.getColumn("price_info".getBytes(), "ticker".getBytes()))).map(row 
>>>>> => {
>>>>> (
>>>>>   row._1.map(_.toChar).mkString,
>>>>>   row._2.asScala.reduceLeft {
>>>>>     (a, b) => if (a.getTimestamp > b.getTimestamp) a else b
>>>>>   }.getValue.map(_.toChar).mkString
>>>>> )
>>>>> })
>>>>> case class columns (key: String, ticker: String)
>>>>> val dfticker = rdd1.toDF.map(p => columns(p(0).toString,p(1).toString))
>>>>> 
>>>>> Note that the end result is a DataFrame with the RowKey -> key and column 
>>>>> -> ticker
>>>>> 
>>>>> I use the same approach to create two other DataFrames, namely 
>>>>> dftimecreated and dfprice for the two other columns.
>>>>> 
>>>>> Note that if I don't need a column, then I do not create a DF for it. So 
>>>>> a DF with each column I use. I am not sure how this compares if I read 
>>>>> the full row through other methods if any.
>>>>> 
>>>>> Anyway all I need to do after creating a DataFrame for each column is to 
>>>>> join themthrough RowKey to slice and dice data. Like below.
>>>>> 
>>>>> Get me the latest prices ordered by timecreated and ticker (ticker is 
>>>>> stock)
>>>>> 
>>>>> val rs = 
>>>>> dfticker.join(dftimecreated,"key").join(dfprice,"key").orderBy('timecreated
>>>>>  desc, 'price desc).select('timecreated, 'ticker, 
>>>>> 'price.cast("Float").as("Latest price"))
>>>>> rs.show(10)
>>>>> 
>>>>> +-------------------+------+------------+
>>>>> |        timecreated|ticker|Latest price|
>>>>> +-------------------+------+------------+
>>>>> |2016-10-16T18:44:57|   S16|   97.631966|
>>>>> |2016-10-16T18:44:57|   S13|    92.11406|
>>>>> |2016-10-16T18:44:57|   S19|    85.93021|
>>>>> |2016-10-16T18:44:57|   S09|   85.714645|
>>>>> |2016-10-16T18:44:57|   S15|    82.38932|
>>>>> |2016-10-16T18:44:57|   S17|    80.77747|
>>>>> |2016-10-16T18:44:57|   S06|    79.81854|
>>>>> |2016-10-16T18:44:57|   S18|    74.10128|
>>>>> |2016-10-16T18:44:57|   S07|    66.13622|
>>>>> |2016-10-16T18:44:57|   S20|    60.35727|
>>>>> +-------------------+------+------------+
>>>>> only showing top 10 rows
>>>>> 
>>>>> Is this a workable solution?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> http://talebzadehmich.wordpress.com
>>>>> 
>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>>> loss, damage or destruction of data or any other property which may arise 
>>>>> from relying on this email's technical content is explicitly disclaimed. 
>>>>> The author will in no case be liable for any monetary damages arising 
>>>>> from such loss, damage or destruction.
>>>>>  
>>>> 
>>> 
>> 
>> 
>> 
>> -- 
>> Best Regards,
>> Ayan Guha
>

Re: Accessing Hbase tables through Spark, this seems to work

Reply via email to