Re: Spark SQL Thriftserver with HBase

Michael Segel Mon, 17 Oct 2016 14:00:03 -0700

@Mitch

You don’t have a schema in HBase other than the table name and the list of 
associated column families.


So you can’t really infer a schema easily…


On Oct 17, 2016, at 2:17 PM, Mich Talebzadeh 
<[email protected]<mailto:[email protected]>> wrote:

How about this method of creating Data Frames on Hbase tables directly.

I define an RDD for each column in the column family as below. In this case 
column trade_info:ticker

//create rdd
val hBaseRDD = sc.newAPIHadoopRDD(conf, 
classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io<http://hbase.io/>.ImmutableBytesWritable],classOf[org.apache.hadoop.hbase.client.Result])
val rdd1 = hBaseRDD.map(tuple => tuple._2).map(result => (result.getRow, 
result.getColumn("price_info".getBytes(), "ticker".getBytes()))).map(row => {
(
  row._1.map(_.toChar).mkString,
  row._2.asScala.reduceLeft {
    (a, b) => if (a.getTimestamp > b.getTimestamp) a else b
  }.getValue.map(_.toChar).mkString
)
})
case class columns (key: String, ticker: String)
val dfticker = rdd1.toDF.map(p => columns(p(0).toString,p(1).toString))

Note that the end result is a DataFrame with the RowKey -> key and column -> 
ticker

I use the same approach to create two other DataFrames, namely dftimecreated 
and dfprice for the two other columns.

Note that if I don't need a column, then I do not create a DF for it. So a DF 
with each column I use. I am not sure how this compares if I read the full row 
through other methods if any.

Anyway all I need to do after creating a DataFrame for each column is to join 
themthrough RowKey to slice and dice data. Like below.

Get me the latest prices ordered by timecreated and ticker (ticker is stock)

val rs = 
dfticker.join(dftimecreated,"key").join(dfprice,"key").orderBy('timecreated 
desc, 'price desc).select('timecreated, 'ticker, 
'price.cast("Float").as("Latest price"))
rs.show(10)

+-------------------+------+------------+
|        timecreated|ticker|Latest price|
+-------------------+------+------------+
|2016-10-16T18:44:57|   S16|   97.631966|
|2016-10-16T18:44:57|   S13|    92.11406|
|2016-10-16T18:44:57|   S19|    85.93021|
|2016-10-16T18:44:57|   S09|   85.714645|
|2016-10-16T18:44:57|   S15|    82.38932|
|2016-10-16T18:44:57|   S17|    80.77747|
|2016-10-16T18:44:57|   S06|    79.81854|
|2016-10-16T18:44:57|   S18|    74.10128|
|2016-10-16T18:44:57|   S07|    66.13622|
|2016-10-16T18:44:57|   S20|    60.35727|
+-------------------+------+------------+
only showing top 10 rows

Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 17 October 2016 at 19:53, vincent gromakowski 
<[email protected]<mailto:[email protected]>> wrote:
Instead of (or additionally to) saving results somewhere, you just start a 
thriftserver that expose the Spark tables of the SQLContext (or SparkSession 
now). That means you can implement any logic (and maybe use structured 
streaming) to expose your data. Today using the thriftserver means reading data 
from the persistent store every query, so if the data modeling doesn't fit the 
query it can be quite long.  What you generally do in a common spark job is to 
load the data and cache spark table in a in-memory columnar table which is 
quite efficient for any kind of query, the counterpart is that the cache isn't 
updated you have to implement a reload mechanism, and this solution isn't 
available using the thriftserver.
What I propose is to mix the two world: periodically/delta load data in spark 
table cache and expose it through the thriftserver. But you have to implement 
the loading logic, it can be very simple to very complex depending on your 
needs.


2016-10-17 19:48 GMT+02:00 Benjamin Kim 
<[email protected]<mailto:[email protected]>>:
Is this technique similar to what Kinesis is offering or what Structured 
Streaming is going to have eventually?

Just curious.

Cheers,
Ben


On Oct 17, 2016, at 10:14 AM, vincent gromakowski 
<[email protected]<mailto:[email protected]>> wrote:

I would suggest to code your own Spark thriftserver which seems to be very easy.
http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server

I am starting to test it. The big advantage is that you can implement any logic 
because it's a spark job and then start a thrift server on temporary table. For 
example you can query a micro batch rdd from a kafka stream, or pre load some 
tables and implement a rolling cache to periodically update the spark in memory 
tables with persistent store...
It's not part of the public API and I don't know yet what are the issues doing 
this but I think Spark community should look at this path: making the 
thriftserver be instantiable in any spark job.

2016-10-17 18:17 GMT+02:00 Michael Segel 
<[email protected]<mailto:[email protected]>>:
Guys,
Sorry for jumping in late to the game…

If memory serves (which may not be a good thing…) :

You can use HiveServer2 as a connection point to HBase.
While this doesn’t perform well, its probably the cleanest solution.
I’m not keen on Phoenix… wouldn’t recommend it….


The issue is that you’re trying to make HBase, a key/value object store, a 
Relational Engine… its not.

There are some considerations which make HBase not ideal for all use cases and 
you may find better performance with Parquet files.

One thing missing is the use of secondary indexing and query optimizations that 
you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
performance will vary.

With respect to Tableau… their entire interface in to the big data world 
revolves around the JDBC/ODBC interface. So if you don’t have that piece as 
part of your solution, you’re DOA w respect to Tableau.

Have you considered Drill as your JDBC connection point?  (YAAP: Yet another 
Apache project)


On Oct 9, 2016, at 12:23 PM, Benjamin Kim 
<[email protected]<mailto:[email protected]>> wrote:

Thanks for all the suggestions. It would seem you guys are right about the 
Tableau side of things. The reports don’t need to be real-time, and they won’t 
be directly feeding off of the main DMP HBase data. Instead, it’ll be batched 
to Parquet or Kudu/Impala or even PostgreSQL.

I originally thought that we needed two-way data retrieval from the DMP HBase 
for ID generation, but after further investigation into the use-case and 
architecture, the ID generation needs to happen local to the Ad Servers where 
we generate a unique ID and store it in a ID linking table. Even better, many 
of the 3rd party services supply this ID. So, data only needs to flow in one 
direction. We will use Kafka as the bus for this. No JDBC required. This is 
also goes for the REST Endpoints. 3rd party services will hit ours to update 
our data with no need to read from our data. And, when we want to update their 
data, we will hit theirs to update their data using a triggered job.

This al boils down to just integrating with Kafka.

Once again, thanks for all the help.

Cheers,
Ben


On Oct 9, 2016, at 3:16 AM, Jörn Franke 
<[email protected]<mailto:[email protected]>> wrote:

please keep also in mind that Tableau Server has the capabilities to store data 
in-memory and refresh only when needed the in-memory data. This means you can 
import it from any source and let your users work only on the in-memory data in 
Tableau Server.

On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke 
<[email protected]<mailto:[email protected]>> wrote:
Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided 
already a good alternative. However, you should check if it contains a recent 
version of Hbase and Phoenix. That being said, I just wonder what is the 
dataflow, data model and the analysis you plan to do. Maybe there are 
completely different solutions possible. Especially these single inserts, 
upserts etc. should be avoided as much as possible in the Big Data (analysis) 
world with any technology, because they do not perform well.

Hive with Llap will provide an in-memory cache for interactive analytics. You 
can put full tables in-memory with Hive using Ignite HDFS in-memory solution. 
All this does only make sense if you do not use MR as an engine, the right 
input format (ORC, parquet) and a recent Hive version.

On 8 Oct 2016, at 21:55, Benjamin Kim 
<[email protected]<mailto:[email protected]>> wrote:

Mich,

Unfortunately, we are moving away from Hive and unifying on Spark using CDH 5.8 
as our distro. And, the Tableau released a Spark ODBC/JDBC driver too. I will 
either try Phoenix JDBC Server for HBase or push to move faster to Kudu with 
Impala. We will use Impala as the JDBC in-between until the Kudu team completes 
Spark SQL support for JDBC.

Thanks for the advice.

Cheers,
Ben


On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh 
<[email protected]<mailto:[email protected]>> wrote:

Sure. But essentially you are looking at batch data for analytics for your 
tableau users so Hive may be a better choice with its rich SQL and ODBC.JDBC 
connection to Tableau already.

I would go for Hive especially the new release will have an in-memory offering 
as well for frequently accessed data :)


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 8 October 2016 at 20:15, Benjamin Kim 
<[email protected]<mailto:[email protected]>> wrote:
Mich,

First and foremost, we have visualization servers that run Tableau for external 
user reports. Second, we have servers that are ad servers and REST endpoints 
for cookie sync and segmentation data exchange. These will use JDBC directly 
within the same data-center. When not colocated in the same data-center, they 
will connected to a located database server using JDBC. Either way, by using 
JDBC everywhere, it simplifies and unifies the code on the JDBC industry 
standard.

Does this make sense?

Thanks,
Ben


On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh 
<[email protected]<mailto:[email protected]>> wrote:

Like any other design what is your presentation layer and end users?

Are they SQL centric users from Tableau background or they may use spark 
functional programming.

It is best to describe the use case.

HTH

Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 8 October 2016 at 19:40, Felix Cheung 
<[email protected]<mailto:[email protected]>> wrote:
I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC server 
- HBASE would work better.

Without naming specifics, there are at least 4 or 5 different implementations 
of HBASE sources, each at varying level of development and different 
requirements (HBASE release version, Kerberos support etc)


_____________________________
From: Benjamin Kim <[email protected]<mailto:[email protected]>>
Sent: Saturday, October 8, 2016 11:26 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Mich Talebzadeh 
<[email protected]<mailto:[email protected]>>
Cc: <[email protected]<mailto:[email protected]>>, Felix Cheung 
<[email protected]<mailto:[email protected]>>



Mich,

Are you talking about the Phoenix JDBC Server? If so, I forgot about that 
alternative.

Thanks,
Ben


On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh 
<[email protected]<mailto:[email protected]>> wrote:

I don't think it will work

you can use phoenix on top of hbase

hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
ROW                                                       COLUMN+CELL
 TSCO-1-Apr-08                                            
column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
 TSCO-1-Apr-08                                            
column=stock_daily:close, timestamp=1475866783376, value=405.25
 TSCO-1-Apr-08                                            
column=stock_daily:high, timestamp=1475866783376, value=406.75
 TSCO-1-Apr-08                                            
column=stock_daily:low, timestamp=1475866783376, value=379.25
 TSCO-1-Apr-08                                            
column=stock_daily:open, timestamp=1475866783376, value=380.00
 TSCO-1-Apr-08                                            
column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
 TSCO-1-Apr-08                                            
column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
 TSCO-1-Apr-08                                            
column=stock_daily:volume, timestamp=1475866783376, value=49664486

And the same on Phoenix on top of Hvbase table

0: jdbc:phoenix:thin:url=http://rhes564:8765<http://rhes564:8765/>> select 
substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, "close" AS 
"Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open" AS "Day's 
Open", "ticker", "volume", (to_number("low")+to_number("high"))/2 AS 
"AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high" != '-' 
and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd') order by  
to_date("Date",'dd-MMM-yy') limit 1;
+-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
|  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  | ticker  
|  volume   | AverageDailyPrice  |
+-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
| 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      | TSCO    
| 30046994  | 191.445            |


HTH




Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction.



On 8 October 2016 at 19:05, Felix Cheung 
<[email protected]<mailto:[email protected]>> wrote:
Great, then I think those packages as Spark data source should allow you to do 
exactly that (replace org.apache.spark.sql.jdbc with HBASE one)

I do think it will be great to get more examples around this though. Would be 
great if you could share your experience with this!


_____________________________
From: Benjamin Kim <[email protected]<mailto:[email protected]>>
Sent: Saturday, October 8, 2016 11:00 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Felix Cheung <[email protected]<mailto:[email protected]>>
Cc: <[email protected]<mailto:[email protected]>>


Felix,

My goal is to use Spark SQL JDBC Thriftserver to access HBase tables using just 
SQL. I have been able to CREATE tables using this statement below in the past:

CREATE TABLE <table-name>
USING org.apache.spark.sql.jdbc
OPTIONS (
  url 
"jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>",
  dbtable "dim.dimension_acamp"
);

After doing this, I can access the PostgreSQL table using Spark SQL JDBC 
Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want to do 
the same with HBase tables. We tried this using Hive and HiveServer2, but the 
response times are just too long.

Thanks,
Ben


On Oct 8, 2016, at 10:53 AM, Felix Cheung 
<[email protected]<mailto:[email protected]>> wrote:

Ben,

I'm not sure I'm following completely.

Is your goal to use Spark to create or access tables in HBASE? If so the link 
below and several packages out there support that by having a HBASE data source 
for Spark. There are some examples on how the Spark code look like in that link 
as well. On that note, you should also be able to use the HBASE data source 
from pure SQL (Spark SQL) query as well, which should work in the case with the 
Spark SQL JDBC Thrift Server (with 
USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10).


_____________________________
From: Benjamin Kim <[email protected]<mailto:[email protected]>>
Sent: Saturday, October 8, 2016 10:40 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Felix Cheung <[email protected]<mailto:[email protected]>>
Cc: <[email protected]<mailto:[email protected]>>


Felix,

The only alternative way is to create a stored procedure (udf) in database 
terms that would run Spark scala code underneath. In this way, I can use Spark 
SQL JDBC Thriftserver to execute it using SQL code passing the key, values I 
want to UPSERT. I wonder if this is possible since I cannot CREATE a wrapper 
table on top of a HBase table in Spark SQL?

What do you think? Is this the right approach?

Thanks,
Ben

On Oct 8, 2016, at 10:33 AM, Felix Cheung 
<[email protected]<mailto:[email protected]>> wrote:

HBase has released support for Spark
hbase.apache.org/book.html#spark<http://hbase.apache.org/book.html#spark>

And if you search you should find several alternative approaches.





On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" 
<[email protected]<mailto:[email protected]>> wrote:

Does anyone know if Spark can work with HBase tables using Spark SQL? I know in 
Hive we are able to create tables on top of an underlying HBase table that can 
be accessed using MapReduce jobs. Can the same be done using HiveContext or 
SQLContext? We are trying to setup a way to GET and POST data to and from the 
HBase table using the Spark SQL JDBC thriftserver from our RESTful API 
endpoints and/or HTTP web farms. If we can get this to work, then we can load 
balance the thriftservers. In addition, this will benefit us in giving us a way 
to abstract the data storage layer away from the presentation layer code. There 
is a chance that we will swap out the data storage technology in the future. We 
are currently experimenting with Kudu.

Thanks,
Ben
---------------------------------------------------------------------
To unsubscribe e-mail: 
[email protected]<mailto:[email protected]>

Re: Spark SQL Thriftserver with HBase

Reply via email to