It has some implication because it imposes the SQL model on Hbase. Internally it translates the SQL queries into custom Hbase processors. Keep also in mind for what Hbase need a proper key design and how Phoenix designs those keys to get the best performance out of it. I think for oltp it is a workable model and I think they plan to offer Phoenix as a default interface as part of Hbase anyway. For OLAP it depends.
> On 17 Oct 2016, at 22:34, ayan guha <guha.a...@gmail.com> wrote: > > Hi > > Any reason not to recommend Phoneix? I haven't used it myself so curious > about pro's and cons about the use of it. > >> On 18 Oct 2016 03:17, "Michael Segel" <msegel_had...@hotmail.com> wrote: >> Guys, >> Sorry for jumping in late to the game… >> >> If memory serves (which may not be a good thing…) : >> >> You can use HiveServer2 as a connection point to HBase. >> While this doesn’t perform well, its probably the cleanest solution. >> I’m not keen on Phoenix… wouldn’t recommend it…. >> >> >> The issue is that you’re trying to make HBase, a key/value object store, a >> Relational Engine… its not. >> >> There are some considerations which make HBase not ideal for all use cases >> and you may find better performance with Parquet files. >> >> One thing missing is the use of secondary indexing and query optimizations >> that you have in RDBMSs and are lacking in HBase / MapRDB / etc … so your >> performance will vary. >> >> With respect to Tableau… their entire interface in to the big data world >> revolves around the JDBC/ODBC interface. So if you don’t have that piece as >> part of your solution, you’re DOA w respect to Tableau. >> >> Have you considered Drill as your JDBC connection point? (YAAP: Yet another >> Apache project) >> >> >>> On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bbuil...@gmail.com> wrote: >>> >>> Thanks for all the suggestions. It would seem you guys are right about the >>> Tableau side of things. The reports don’t need to be real-time, and they >>> won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be >>> batched to Parquet or Kudu/Impala or even PostgreSQL. >>> >>> I originally thought that we needed two-way data retrieval from the DMP >>> HBase for ID generation, but after further investigation into the use-case >>> and architecture, the ID generation needs to happen local to the Ad Servers >>> where we generate a unique ID and store it in a ID linking table. Even >>> better, many of the 3rd party services supply this ID. So, data only needs >>> to flow in one direction. We will use Kafka as the bus for this. No JDBC >>> required. This is also goes for the REST Endpoints. 3rd party services will >>> hit ours to update our data with no need to read from our data. And, when >>> we want to update their data, we will hit theirs to update their data using >>> a triggered job. >>> >>> This al boils down to just integrating with Kafka. >>> >>> Once again, thanks for all the help. >>> >>> Cheers, >>> Ben >>> >>> >>>> On Oct 9, 2016, at 3:16 AM, Jörn Franke <jornfra...@gmail.com> wrote: >>>> >>>> please keep also in mind that Tableau Server has the capabilities to store >>>> data in-memory and refresh only when needed the in-memory data. This means >>>> you can import it from any source and let your users work only on the >>>> in-memory data in Tableau Server. >>>> >>>>> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jornfra...@gmail.com> wrote: >>>>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich >>>>> provided already a good alternative. However, you should check if it >>>>> contains a recent version of Hbase and Phoenix. That being said, I just >>>>> wonder what is the dataflow, data model and the analysis you plan to do. >>>>> Maybe there are completely different solutions possible. Especially these >>>>> single inserts, upserts etc. should be avoided as much as possible in the >>>>> Big Data (analysis) world with any technology, because they do not >>>>> perform well. >>>>> >>>>> Hive with Llap will provide an in-memory cache for interactive analytics. >>>>> You can put full tables in-memory with Hive using Ignite HDFS in-memory >>>>> solution. All this does only make sense if you do not use MR as an >>>>> engine, the right input format (ORC, parquet) and a recent Hive version. >>>>> >>>>> On 8 Oct 2016, at 21:55, Benjamin Kim <bbuil...@gmail.com> wrote: >>>>> >>>>>> Mich, >>>>>> >>>>>> Unfortunately, we are moving away from Hive and unifying on Spark using >>>>>> CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC >>>>>> driver too. I will either try Phoenix JDBC Server for HBase or push to >>>>>> move faster to Kudu with Impala. We will use Impala as the JDBC >>>>>> in-between until the Kudu team completes Spark SQL support for JDBC. >>>>>> >>>>>> Thanks for the advice. >>>>>> >>>>>> Cheers, >>>>>> Ben >>>>>> >>>>>> >>>>>>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh >>>>>>> <mich.talebza...@gmail.com> wrote: >>>>>>> >>>>>>> Sure. But essentially you are looking at batch data for analytics for >>>>>>> your tableau users so Hive may be a better choice with its rich SQL and >>>>>>> ODBC.JDBC connection to Tableau already. >>>>>>> >>>>>>> I would go for Hive especially the new release will have an in-memory >>>>>>> offering as well for frequently accessed data :) >>>>>>> >>>>>>> >>>>>>> Dr Mich Talebzadeh >>>>>>> >>>>>>> LinkedIn >>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>> >>>>>>> http://talebzadehmich.wordpress.com >>>>>>> >>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>>>>>> loss, damage or destruction of data or any other property which may >>>>>>> arise from relying on this email's technical content is explicitly >>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>> damages arising from such loss, damage or destruction. >>>>>>> >>>>>>> >>>>>>>> On 8 October 2016 at 20:15, Benjamin Kim <bbuil...@gmail.com> wrote: >>>>>>>> Mich, >>>>>>>> >>>>>>>> First and foremost, we have visualization servers that run Tableau for >>>>>>>> external user reports. Second, we have servers that are ad servers and >>>>>>>> REST endpoints for cookie sync and segmentation data exchange. These >>>>>>>> will use JDBC directly within the same data-center. When not colocated >>>>>>>> in the same data-center, they will connected to a located database >>>>>>>> server using JDBC. Either way, by using JDBC everywhere, it simplifies >>>>>>>> and unifies the code on the JDBC industry standard. >>>>>>>> >>>>>>>> Does this make sense? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Ben >>>>>>>> >>>>>>>> >>>>>>>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh >>>>>>>>> <mich.talebza...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> Like any other design what is your presentation layer and end users? >>>>>>>>> >>>>>>>>> Are they SQL centric users from Tableau background or they may use >>>>>>>>> spark functional programming. >>>>>>>>> >>>>>>>>> It is best to describe the use case. >>>>>>>>> >>>>>>>>> HTH >>>>>>>>> >>>>>>>>> Dr Mich Talebzadeh >>>>>>>>> >>>>>>>>> LinkedIn >>>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>>>> >>>>>>>>> http://talebzadehmich.wordpress.com >>>>>>>>> >>>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for >>>>>>>>> any loss, damage or destruction of data or any other property which >>>>>>>>> may arise from relying on this email's technical content is >>>>>>>>> explicitly disclaimed. The author will in no case be liable for any >>>>>>>>> monetary damages arising from such loss, damage or destruction. >>>>>>>>> >>>>>>>>> >>>>>>>>>> On 8 October 2016 at 19:40, Felix Cheung <felixcheun...@hotmail.com> >>>>>>>>>> wrote: >>>>>>>>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix >>>>>>>>>> JDBC server - HBASE would work better. >>>>>>>>>> >>>>>>>>>> Without naming specifics, there are at least 4 or 5 different >>>>>>>>>> implementations of HBASE sources, each at varying level of >>>>>>>>>> development and different requirements (HBASE release version, >>>>>>>>>> Kerberos support etc) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _____________________________ >>>>>>>>>> From: Benjamin Kim <bbuil...@gmail.com> >>>>>>>>>> Sent: Saturday, October 8, 2016 11:26 AM >>>>>>>>>> Subject: Re: Spark SQL Thriftserver with HBase >>>>>>>>>> To: Mich Talebzadeh <mich.talebza...@gmail.com> >>>>>>>>>> Cc: <user@spark.apache.org>, Felix Cheung <felixcheun...@hotmail.com> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Mich, >>>>>>>>>> >>>>>>>>>> Are you talking about the Phoenix JDBC Server? If so, I forgot about >>>>>>>>>> that alternative. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Ben >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh >>>>>>>>>> <mich.talebza...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>> I don't think it will work >>>>>>>>>> >>>>>>>>>> you can use phoenix on top of hbase >>>>>>>>>> >>>>>>>>>> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1 >>>>>>>>>> ROW COLUMN+CELL >>>>>>>>>> TSCO-1-Apr-08 >>>>>>>>>> column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08 >>>>>>>>>> TSCO-1-Apr-08 >>>>>>>>>> column=stock_daily:close, timestamp=1475866783376, value=405.25 >>>>>>>>>> TSCO-1-Apr-08 >>>>>>>>>> column=stock_daily:high, timestamp=1475866783376, value=406.75 >>>>>>>>>> TSCO-1-Apr-08 >>>>>>>>>> column=stock_daily:low, timestamp=1475866783376, value=379.25 >>>>>>>>>> TSCO-1-Apr-08 >>>>>>>>>> column=stock_daily:open, timestamp=1475866783376, value=380.00 >>>>>>>>>> TSCO-1-Apr-08 >>>>>>>>>> column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC >>>>>>>>>> TSCO-1-Apr-08 >>>>>>>>>> column=stock_daily:ticker, timestamp=1475866783376, value=TSCO >>>>>>>>>> TSCO-1-Apr-08 >>>>>>>>>> column=stock_daily:volume, timestamp=1475866783376, value=49664486 >>>>>>>>>> >>>>>>>>>> And the same on Phoenix on top of Hvbase table >>>>>>>>>> >>>>>>>>>> 0: jdbc:phoenix:thin:url=http://rhes564:8765> select >>>>>>>>>> substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, >>>>>>>>>> "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's >>>>>>>>>> Low", "open" AS "Day's Open", "ticker", "volume", >>>>>>>>>> (to_number("low")+to_number("high"))/2 AS "AverageDailyPrice" from >>>>>>>>>> "tsco" where to_number("volume") > 0 and "high" != '-' and >>>>>>>>>> to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd') >>>>>>>>>> order by to_date("Date",'dd-MMM-yy') limit 1; >>>>>>>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+ >>>>>>>>>> | TRADEDATE | Day's close | Day's High | Day's Low | Day's Open >>>>>>>>>> | ticker | volume | AverageDailyPrice | >>>>>>>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+ >>>>>>>>>> | 2015-10-07 | 197.00 | 198.05 | 184.84 | 192.20 >>>>>>>>>> | TSCO | 30046994 | 191.445 | >>>>>>>>>> >>>>>>>>>> HTH >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Dr Mich Talebzadeh >>>>>>>>>> >>>>>>>>>> LinkedIn >>>>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>>>>> >>>>>>>>>> http://talebzadehmich.wordpress.com >>>>>>>>>> >>>>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for >>>>>>>>>> any loss, damage or destructionof data or any other property which >>>>>>>>>> may arise from relying on this email's technical content is >>>>>>>>>> explicitly disclaimed.The author will in no case be liable for any >>>>>>>>>> monetary damages arising from suchloss, damage or destruction. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On 8 October 2016 at 19:05, Felix Cheung >>>>>>>>>>> <felixcheun...@hotmail.com> wrote: >>>>>>>>>>> Great, then I think those packages as Spark data source should >>>>>>>>>>> allow you to do exactly that (replace org.apache.spark.sql.jdbc >>>>>>>>>>> with HBASE one) >>>>>>>>>>> >>>>>>>>>>> I do think it will be great to get more examples around this >>>>>>>>>>> though. Would be great if you could share your experience with this! >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _____________________________ >>>>>>>>>>> From: Benjamin Kim <bbuil...@gmail.com> >>>>>>>>>>> Sent: Saturday, October 8, 2016 11:00 AM >>>>>>>>>>> Subject: Re: Spark SQL Thriftserver with HBase >>>>>>>>>>> To: Felix Cheung <felixcheun...@hotmail.com> >>>>>>>>>>> Cc: <user@spark.apache.org> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Felix, >>>>>>>>>>> >>>>>>>>>>> My goal is to use Spark SQL JDBC Thriftserver to access HBase >>>>>>>>>>> tables using just SQL. I have been able to CREATE tables using this >>>>>>>>>>> statement below in the past: >>>>>>>>>>> >>>>>>>>>>> CREATE TABLE <table-name> >>>>>>>>>>> USING org.apache.spark.sql.jdbc >>>>>>>>>>> OPTIONS ( >>>>>>>>>>> url >>>>>>>>>>> "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>", >>>>>>>>>>> dbtable "dim.dimension_acamp" >>>>>>>>>>> ); >>>>>>>>>>> >>>>>>>>>>> After doing this, I can access the PostgreSQL table using Spark SQL >>>>>>>>>>> JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, >>>>>>>>>>> etc.). I want to do the same with HBase tables. We tried this using >>>>>>>>>>> Hive and HiveServer2, but the response times are just too long. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Ben >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Oct 8, 2016, at 10:53 AM, Felix Cheung >>>>>>>>>>> <felixcheun...@hotmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>> Ben, >>>>>>>>>>> >>>>>>>>>>> I'm not sure I'm following completely. >>>>>>>>>>> >>>>>>>>>>> Is your goal to use Spark to create or access tables in HBASE? If >>>>>>>>>>> so the link below and several packages out there support that by >>>>>>>>>>> having a HBASE data source for Spark. There are some examples on >>>>>>>>>>> how the Spark code look like in that link as well. On that note, >>>>>>>>>>> you should also be able to use the HBASE data source from pure SQL >>>>>>>>>>> (Spark SQL) query as well, which should work in the case with the >>>>>>>>>>> Spark SQL JDBC Thrift Server (with >>>>>>>>>>> USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10). >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _____________________________ >>>>>>>>>>> From: Benjamin Kim <bbuil...@gmail.com> >>>>>>>>>>> Sent: Saturday, October 8, 2016 10:40 AM >>>>>>>>>>> Subject: Re: Spark SQL Thriftserver with HBase >>>>>>>>>>> To: Felix Cheung <felixcheun...@hotmail.com> >>>>>>>>>>> Cc: <user@spark.apache.org> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Felix, >>>>>>>>>>> >>>>>>>>>>> The only alternative way is to create a stored procedure (udf) in >>>>>>>>>>> database terms that would run Spark scala code underneath. In this >>>>>>>>>>> way, I can use Spark SQL JDBC Thriftserver to execute it using SQL >>>>>>>>>>> code passing the key, values I want to UPSERT. I wonder if this is >>>>>>>>>>> possible since I cannot CREATE a wrapper table on top of a HBase >>>>>>>>>>> table in Spark SQL? >>>>>>>>>>> >>>>>>>>>>> What do you think? Is this the right approach? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Ben >>>>>>>>>>> >>>>>>>>>>> On Oct 8, 2016, at 10:33 AM, Felix Cheung >>>>>>>>>>> <felixcheun...@hotmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>> HBase has released support for Spark >>>>>>>>>>> hbase.apache.org/book.html#spark >>>>>>>>>>> >>>>>>>>>>> And if you search you should find several alternative approaches. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" >>>>>>>>>>> <bbuil...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>> Does anyone know if Spark can work with HBase tables using Spark >>>>>>>>>>> SQL? I know in Hive we are able to create tables on top of an >>>>>>>>>>> underlying HBase table that can be accessed using MapReduce jobs. >>>>>>>>>>> Can the same be done using HiveContext or SQLContext? We are trying >>>>>>>>>>> to setup a way to GET and POST data to and from the HBase table >>>>>>>>>>> using the Spark SQL JDBC thriftserver from our RESTful API >>>>>>>>>>> endpoints and/or HTTP web farms. If we can get this to work, then >>>>>>>>>>> we can load balance the thriftservers. In addition, this will >>>>>>>>>>> benefit us in giving us a way to abstract the data storage layer >>>>>>>>>>> away from the presentation layer code. There is a chance that we >>>>>>>>>>> will swap out the data storage technology in the future. We are >>>>>>>>>>> currently experimenting with Kudu. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Ben >>>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>> >>> >>