Careful Hbase with Phoenix is only in certain scenarios faster. When it is about processing small amounts out of a bigger amount of data (depends on node memory, the operation etc). Hive+tez+orc can be rather competitive, llap makes sense for interactive ad-hoc queries that are rather similar. Both Phoenix and hive follow different purposes with a different architecture and underlying data structure.
> On 18 Oct 2016, at 07:44, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > yes Hive external table is partitioned on a daily basis (datestamp below) > > CREATE EXTERNAL TABLE IF NOT EXISTS ${DATABASE}.externalMarketData ( > KEY string > , SECURITY string > , TIMECREATED string > , PRICE float > ) > COMMENT 'From prices Kakfa delivered by Flume location by day' > ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' > STORED AS TEXTFILE > LOCATION 'hdfs://rhes564:9000/data/prices/' > --TBLPROPERTIES ("skip.header.line.count"="1") > ; > ALTER TABLE ${DATABASE}.externalMarketData set location > 'hdfs://rhes564:9000/data/prices/${TODAY}'; > > and there is insert/overwrite into managed table every 15 minutes. > > INSERT OVERWRITE TABLE ${DATABASE}.marketData PARTITION (DateStamp = > "${TODAY}") > SELECT > KEY > , SECURITY > , TIMECREATED > , PRICE > , 1 > , CAST(from_unixtime(unix_timestamp()) AS timestamp) > FROM ${DATABASE}.externalMarketData > > That works fine. However, Hbase is much faster for data retrieval with phoenix > > When we get Hive with LLAP, I gather Hive will replace Hbase. > > So in summary we have > > raw data delivered to HDFS > data ingested into Hbase via cron > HDFS directory is mapped to Hive external table > There is Hive managed table with added optimisation/indexing (ORC) > > There are a number of ways of doing it as usual. > > Thanks > > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpress.com > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > >> On 18 October 2016 at 00:48, ayan guha <guha.a...@gmail.com> wrote: >> I do not see a rationale to have hbase in this scheme of things....may be I >> am missing something? >> >> If data is delivered in HDFS, why not just add partition to an existing Hive >> table? >> >>> On Tue, Oct 18, 2016 at 8:23 AM, Mich Talebzadeh >>> <mich.talebza...@gmail.com> wrote: >>> Thanks Mike, >>> >>> My test csv data comes as >>> >>> UUID, ticker, timecreated, price >>> a2c844ed-137f-4820-aa6e-c49739e46fa6, S01, 2016-10-17T22:02:09, >>> 53.36665625650533484995 >>> a912b65e-b6bc-41d4-9e10-d6a44ea1a2b0, S02, 2016-10-17T22:02:09, >>> 86.31917515824627016510 >>> 5f4e3a9d-05cc-41a2-98b3-40810685641e, S03, 2016-10-17T22:02:09, >>> 95.48298277703729129559 >>> >>> >>> And this is my Hbase table with one column family >>> >>> create 'marketDataHbase', 'price_info' >>> >>> It is populated every 15 minutes from test.csv files delivered via Kafka >>> and Flume to HDFS >>> >>> Create a fat csv file based on all small csv files for today --> >>> prices/2016-10-17 >>> Populate data into Hbase table using >>> org.apache.hadoop.hbase.mapreduce.ImportTsv >>> This is pretty quick using MapReduce >>> >>> That importTsv only appends new rows to Hbase table as the choice of UUID >>> as rowKey avoids any issues. >>> >>> So I only have 15 minutes lag in my batch Hbase table. >>> >>> I have both Hive ORC tables and Phoenix views on top of this Hbase tables. >>> >>> Phoenix offers the fastest response if used on top of Hbase. unfortunately, >>> Spark 2 with Phoenix is broken >>> Spark on Hive on Hbase looks OK. This works fine with Spark 2 >>> Spark on Hbase tables directly using key, value DFs for each column. Not as >>> fast as 2 but works. I don't think a DF is a good choice for a key, value >>> pair? >>> Now if I use Zeppelin to read from Hbase >>> >>> I can use Phoenix JDBC. That looks very fast >>> I can use Spark csv directly on HDFS csv files. >>> I can use Spark on Hive tables >>> >>> If I use Tableau on Hbase data then, only sql like code is useful. Phoenix >>> or Hive >>> >>> I don't want to change the design now. But admittedly Hive is the best SQL >>> on top of Hbase. Next release of Hive is going to have in-memory database >>> (LLAP) so we can cache Hive tables in memory. That will be faster. Many >>> people underestimate Hive but I still believe it has a lot to offer besides >>> serious ANSI compliant SQL. >>> >>> Regards >>> >>> Mich >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> Dr Mich Talebzadeh >>> >>> LinkedIn >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> >>> http://talebzadehmich.wordpress.com >>> >>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>> loss, damage or destruction of data or any other property which may arise >>> from relying on this email's technical content is explicitly disclaimed. >>> The author will in no case be liable for any monetary damages arising from >>> such loss, damage or destruction. >>> >>> >>>> On 17 October 2016 at 21:54, Michael Segel <msegel_had...@hotmail.com> >>>> wrote: >>>> Mitch, >>>> >>>> Short answer… no, it doesn’t scale. >>>> >>>> Longer answer… >>>> >>>> You are using an UUID as the row key? Why? (My guess is that you want to >>>> avoid hot spotting) >>>> >>>> So you’re going to have to pull in all of the data… meaning a full table >>>> scan… and then perform a sort order transformation, dropping the UUID in >>>> the process. >>>> >>>> You would be better off not using HBase and storing the data in Parquet >>>> files in a directory partitioned on date. Or rather the rowkey would be >>>> the max_ts - TS so that your data is in LIFO. >>>> Note: I’ve used the term epoch to describe the max value of a long (8 >>>> bytes of ‘FF’ ) for the max_ts. This isn’t a good use of the term epoch, >>>> but if anyone has a better term, please let me know. >>>> >>>> >>>> >>>> Having said that… if you want to use HBase, you could do the same thing. >>>> If you want to avoid hot spotting, you could load the day’s transactions >>>> using a bulk loader so that you don’t have to worry about splits. >>>> >>>> But that’s just my $0.02 cents worth. >>>> >>>> HTH >>>> >>>> -Mike >>>> >>>> PS. If you wanted to capture the transactions… you could do the following >>>> schemea: >>>> >>>> 1) Rowkey = max_ts - TS >>>> 2) Rows contain the following: >>>> CUSIP (Transaction ID) >>>> Party 1 (Seller) >>>> Party 2 (Buyer) >>>> Symbol >>>> Qty >>>> Price >>>> >>>> This is a trade ticket. >>>> >>>> >>>> >>>>> On Oct 16, 2016, at 1:37 PM, Mich Talebzadeh <mich.talebza...@gmail.com> >>>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I have trade data stored in Hbase table. Data arrives in csv format to >>>>> HDFS and then loaded into Hbase via periodic load with >>>>> org.apache.hadoop.hbase.mapreduce.ImportTsv. >>>>> >>>>> The Hbase table has one Column family "trade_info" and three columns: >>>>> ticker, timecreated, price. >>>>> >>>>> The RowKey is UUID. So each row has UUID, ticker, timecreated and price >>>>> in the csv file >>>>> >>>>> Each row in Hbase is a key, value map. In my case, I have one Column >>>>> Family and three columns. Without going into semantics I see Hbase as a >>>>> column oriented database where column data stay together. >>>>> >>>>> So I thought of this way of accessing the data. >>>>> >>>>> I define an RDD for each column in the column family as below. In this >>>>> case column trade_info:ticker >>>>> >>>>> //create rdd >>>>> val hBaseRDD = sc.newAPIHadoopRDD(conf, >>>>> classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],classOf[org.apache.hadoop.hbase.client.Result]) >>>>> val rdd1 = hBaseRDD.map(tuple => tuple._2).map(result => (result.getRow, >>>>> result.getColumn("price_info".getBytes(), "ticker".getBytes()))).map(row >>>>> => { >>>>> ( >>>>> row._1.map(_.toChar).mkString, >>>>> row._2.asScala.reduceLeft { >>>>> (a, b) => if (a.getTimestamp > b.getTimestamp) a else b >>>>> }.getValue.map(_.toChar).mkString >>>>> ) >>>>> }) >>>>> case class columns (key: String, ticker: String) >>>>> val dfticker = rdd1.toDF.map(p => columns(p(0).toString,p(1).toString)) >>>>> >>>>> Note that the end result is a DataFrame with the RowKey -> key and column >>>>> -> ticker >>>>> >>>>> I use the same approach to create two other DataFrames, namely >>>>> dftimecreated and dfprice for the two other columns. >>>>> >>>>> Note that if I don't need a column, then I do not create a DF for it. So >>>>> a DF with each column I use. I am not sure how this compares if I read >>>>> the full row through other methods if any. >>>>> >>>>> Anyway all I need to do after creating a DataFrame for each column is to >>>>> join themthrough RowKey to slice and dice data. Like below. >>>>> >>>>> Get me the latest prices ordered by timecreated and ticker (ticker is >>>>> stock) >>>>> >>>>> val rs = >>>>> dfticker.join(dftimecreated,"key").join(dfprice,"key").orderBy('timecreated >>>>> desc, 'price desc).select('timecreated, 'ticker, >>>>> 'price.cast("Float").as("Latest price")) >>>>> rs.show(10) >>>>> >>>>> +-------------------+------+------------+ >>>>> | timecreated|ticker|Latest price| >>>>> +-------------------+------+------------+ >>>>> |2016-10-16T18:44:57| S16| 97.631966| >>>>> |2016-10-16T18:44:57| S13| 92.11406| >>>>> |2016-10-16T18:44:57| S19| 85.93021| >>>>> |2016-10-16T18:44:57| S09| 85.714645| >>>>> |2016-10-16T18:44:57| S15| 82.38932| >>>>> |2016-10-16T18:44:57| S17| 80.77747| >>>>> |2016-10-16T18:44:57| S06| 79.81854| >>>>> |2016-10-16T18:44:57| S18| 74.10128| >>>>> |2016-10-16T18:44:57| S07| 66.13622| >>>>> |2016-10-16T18:44:57| S20| 60.35727| >>>>> +-------------------+------+------------+ >>>>> only showing top 10 rows >>>>> >>>>> Is this a workable solution? >>>>> >>>>> Thanks >>>>> >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> LinkedIn >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>>>> loss, damage or destruction of data or any other property which may arise >>>>> from relying on this email's technical content is explicitly disclaimed. >>>>> The author will in no case be liable for any monetary damages arising >>>>> from such loss, damage or destruction. >>>>> >>>> >>> >> >> >> >> -- >> Best Regards, >> Ayan Guha >