Re: Accessing Hbase tables through Spark, this seems to work

2016-10-18 Thread Mich Talebzadeh
The design really needs to look at other stack as well.

If the visualisation layer is going to use Tableau then you cannot use
Spark functional programming. Only Spark SQL or anything that works with
SQL like Hive or Phoenix.

Tableau is not a real time dashboard so for analytics it maps tables in
database as it sees it. It has ODBC/JDBC connection to Hive (don't know
about Phoenix).

So that is the advantage of Hive. Any caching, yes you can cache some data
in Tableau Server cache but we all agree that it is only finite. The same
is true for anything that relies on memory Hive + LLAP, any in-memory
database (I tried Tableau on Oracle TimesTen), you can only cache certain
amount of data and no one is going to splash for large memory for analytics.

Bear in mind that performance is a deployment issue and you are unlikely to
be able to create the same conditions as PROD in a test environment.



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 18 October 2016 at 08:18, Jörn Franke  wrote:

> Careful Hbase with Phoenix is only in certain scenarios faster. When it is
> about processing small amounts out of a bigger amount of data (depends on
> node memory, the operation etc).  Hive+tez+orc can  be rather competitive,
> llap makes sense for interactive ad-hoc queries that are rather similar.
> Both Phoenix and hive follow different purposes with a different
> architecture and underlying data structure.
>
> On 18 Oct 2016, at 07:44, Mich Talebzadeh 
> wrote:
>
> yes Hive external table is partitioned on a daily basis (datestamp below)
>
> CREATE EXTERNAL TABLE IF NOT EXISTS ${DATABASE}.externalMarketData (
>  KEY string
>, SECURITY string
>, TIMECREATED string
>, PRICE float
> )
> COMMENT 'From prices Kakfa delivered by Flume location by day'
> ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> STORED AS TEXTFILE
> LOCATION 'hdfs://rhes564:9000/data/prices/'
> --TBLPROPERTIES ("skip.header.line.count"="1")
> ;
> ALTER TABLE ${DATABASE}.externalMarketData set location
> 'hdfs://rhes564:9000/data/prices/${TODAY}';
>
> and there is insert/overwrite into managed table every 15 minutes.
>
> INSERT OVERWRITE TABLE ${DATABASE}.marketData PARTITION (DateStamp =
> "${TODAY}")
> SELECT
>   KEY
> , SECURITY
> , TIMECREATED
> , PRICE
> , 1
> , CAST(from_unixtime(unix_timestamp()) AS timestamp)
> FROM ${DATABASE}.externalMarketData
>
> That works fine. However, Hbase is much faster for data retrieval with
> phoenix
>
> When we get Hive with LLAP, I gather Hive will replace Hbase.
>
> So in summary we have
>
>
>1. raw data delivered to HDFS
>2. data ingested into Hbase via cron
>3. HDFS directory is mapped to Hive external table
>4. There is Hive managed table with added optimisation/indexing (ORC)
>
>
> There are a number of ways of doing it as usual.
>
> Thanks
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 18 October 2016 at 00:48, ayan guha  wrote:
>
>> I do not see a rationale to have hbase in this scheme of thingsmay be
>> I am missing something?
>>
>> If data is delivered in HDFS, why not just add partition to an existing
>> Hive table?
>>
>> On Tue, Oct 18, 2016 at 8:23 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Thanks Mike,
>>>
>>> My test csv data comes as
>>>
>>> UUID, ticker,  timecreated,
>>> price
>>> a2c844ed-137f-4820-aa6e-c49739e46fa6, S01, 2016-10-17T22:02:09,
>>> 53.36665625650533484995
>>> a912b65e-b6bc-41d4-9e10-d6a44ea1a2b0, S02, 2016-10-17T22:02:09,
>>> 86.31917515824627016510
>>> 5f4e3a9d-05cc-41a2-98b3-40810685641e, S03, 2016-10-17T22:02:09,
>>> 95.48298277703729129559
>>>
>>>
>>> And this is my Hbase table with one column family
>>>
>>> create 'marketDataHbase', 'price_info'
>>>
>>> It is populated every 15 minutes 

Re: Accessing Hbase tables through Spark, this seems to work

2016-10-18 Thread Jörn Franke
Careful Hbase with Phoenix is only in certain scenarios faster. When it is 
about processing small amounts out of a bigger amount of data (depends on node 
memory, the operation etc).  Hive+tez+orc can  be rather competitive, llap 
makes sense for interactive ad-hoc queries that are rather similar. Both 
Phoenix and hive follow different purposes with a different architecture and 
underlying data structure.

> On 18 Oct 2016, at 07:44, Mich Talebzadeh  wrote:
> 
> yes Hive external table is partitioned on a daily basis (datestamp below)
> 
> CREATE EXTERNAL TABLE IF NOT EXISTS ${DATABASE}.externalMarketData (
>  KEY string
>, SECURITY string
>, TIMECREATED string
>, PRICE float
> )
> COMMENT 'From prices Kakfa delivered by Flume location by day'
> ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> STORED AS TEXTFILE
> LOCATION 'hdfs://rhes564:9000/data/prices/'
> --TBLPROPERTIES ("skip.header.line.count"="1")
> ;
> ALTER TABLE ${DATABASE}.externalMarketData set location 
> 'hdfs://rhes564:9000/data/prices/${TODAY}';
> 
> and there is insert/overwrite into managed table every 15 minutes.
> 
> INSERT OVERWRITE TABLE ${DATABASE}.marketData PARTITION (DateStamp = 
> "${TODAY}")
> SELECT
>   KEY
> , SECURITY
> , TIMECREATED
> , PRICE
> , 1
> , CAST(from_unixtime(unix_timestamp()) AS timestamp)
> FROM ${DATABASE}.externalMarketData
> 
> That works fine. However, Hbase is much faster for data retrieval with phoenix
> 
> When we get Hive with LLAP, I gather Hive will replace Hbase.
> 
> So in summary we have
> 
> raw data delivered to HDFS
> data ingested into Hbase via cron
> HDFS directory is mapped to Hive external table
> There is Hive managed table with added optimisation/indexing (ORC)
> 
> There are a number of ways of doing it as usual.
> 
> Thanks
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 18 October 2016 at 00:48, ayan guha  wrote:
>> I do not see a rationale to have hbase in this scheme of thingsmay be I 
>> am missing something? 
>> 
>> If data is delivered in HDFS, why not just add partition to an existing Hive 
>> table? 
>> 
>>> On Tue, Oct 18, 2016 at 8:23 AM, Mich Talebzadeh 
>>>  wrote:
>>> Thanks Mike,
>>>  
>>> My test csv data comes as
>>>  
>>> UUID, ticker,  timecreated,  price
>>> a2c844ed-137f-4820-aa6e-c49739e46fa6, S01, 2016-10-17T22:02:09,  
>>> 53.36665625650533484995
>>> a912b65e-b6bc-41d4-9e10-d6a44ea1a2b0, S02, 2016-10-17T22:02:09,  
>>> 86.31917515824627016510
>>> 5f4e3a9d-05cc-41a2-98b3-40810685641e, S03, 2016-10-17T22:02:09,  
>>> 95.48298277703729129559
>>>  
>>>  
>>> And this is my Hbase table with one column family
>>>  
>>> create 'marketDataHbase', 'price_info'
>>>  
>>> It is populated every 15 minutes from test.csv files delivered via Kafka 
>>> and Flume to HDFS
>>>  
>>> Create a fat csv file based on all small csv files for today --> 
>>> prices/2016-10-17
>>> Populate data into Hbase table using 
>>> org.apache.hadoop.hbase.mapreduce.ImportTsv
>>> This is pretty quick using MapReduce
>>>  
>>> That importTsv only appends new rows to Hbase table as the choice of UUID 
>>> as rowKey avoids any issues.
>>>  
>>> So I only have 15 minutes lag in my batch Hbase table.
>>>  
>>> I have both Hive ORC tables and Phoenix views on top of this Hbase tables.
>>>  
>>> Phoenix offers the fastest response if used on top of Hbase. unfortunately, 
>>> Spark 2 with Phoenix is broken
>>> Spark on Hive on Hbase looks OK. This works fine with Spark 2
>>> Spark on Hbase tables directly using key, value DFs for each column. Not as 
>>> fast as 2 but works. I don't think a DF is a good choice for a key, value 
>>> pair?
>>> Now if I use Zeppelin to read from Hbase
>>>  
>>> I can use Phoenix JDBC. That looks very fast
>>> I can use Spark csv directly on HDFS csv files.
>>> I can use Spark on Hive tables
>>>  
>>> If I use Tableau on Hbase data then, only sql like code is useful. Phoenix 
>>> or Hive
>>>  
>>> I don't want to change the design now. But admittedly Hive is the best SQL 
>>> on top of Hbase. Next release of Hive is going to have in-memory database 
>>> (LLAP) so we can cache Hive tables in memory. That will be faster. Many 
>>> people underestimate Hive but I still believe it has a lot to offer besides 
>>> serious ANSI compliant SQL.
>>>  
>>> Regards
>>>  
>>>  Mich
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>> 

Re: Accessing Hbase tables through Spark, this seems to work

2016-10-17 Thread Mich Talebzadeh
yes Hive external table is partitioned on a daily basis (datestamp below)

CREATE EXTERNAL TABLE IF NOT EXISTS ${DATABASE}.externalMarketData (
 KEY string
   , SECURITY string
   , TIMECREATED string
   , PRICE float
)
COMMENT 'From prices Kakfa delivered by Flume location by day'
ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION 'hdfs://rhes564:9000/data/prices/'
--TBLPROPERTIES ("skip.header.line.count"="1")
;
ALTER TABLE ${DATABASE}.externalMarketData set location
'hdfs://rhes564:9000/data/prices/${TODAY}';

and there is insert/overwrite into managed table every 15 minutes.

INSERT OVERWRITE TABLE ${DATABASE}.marketData PARTITION (DateStamp =
"${TODAY}")
SELECT
  KEY
, SECURITY
, TIMECREATED
, PRICE
, 1
, CAST(from_unixtime(unix_timestamp()) AS timestamp)
FROM ${DATABASE}.externalMarketData

That works fine. However, Hbase is much faster for data retrieval with
phoenix

When we get Hive with LLAP, I gather Hive will replace Hbase.

So in summary we have


   1. raw data delivered to HDFS
   2. data ingested into Hbase via cron
   3. HDFS directory is mapped to Hive external table
   4. There is Hive managed table with added optimisation/indexing (ORC)


There are a number of ways of doing it as usual.

Thanks



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 18 October 2016 at 00:48, ayan guha  wrote:

> I do not see a rationale to have hbase in this scheme of thingsmay be
> I am missing something?
>
> If data is delivered in HDFS, why not just add partition to an existing
> Hive table?
>
> On Tue, Oct 18, 2016 at 8:23 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Thanks Mike,
>>
>> My test csv data comes as
>>
>> UUID, ticker,  timecreated,
>> price
>> a2c844ed-137f-4820-aa6e-c49739e46fa6, S01, 2016-10-17T22:02:09,
>> 53.36665625650533484995
>> a912b65e-b6bc-41d4-9e10-d6a44ea1a2b0, S02, 2016-10-17T22:02:09,
>> 86.31917515824627016510
>> 5f4e3a9d-05cc-41a2-98b3-40810685641e, S03, 2016-10-17T22:02:09,
>> 95.48298277703729129559
>>
>>
>> And this is my Hbase table with one column family
>>
>> create 'marketDataHbase', 'price_info'
>>
>> It is populated every 15 minutes from test.csv files delivered via Kafka
>> and Flume to HDFS
>>
>>
>>1. Create a fat csv file based on all small csv files for today -->
>>prices/2016-10-17
>>2. Populate data into Hbase table using 
>> org.apache.hadoop.hbase.mapreduce.ImportTsv
>>
>>3. This is pretty quick using MapReduce
>>
>>
>> That importTsv only appends new rows to Hbase table as the choice of UUID
>> as rowKey avoids any issues.
>>
>> So I only have 15 minutes lag in my batch Hbase table.
>>
>> I have both Hive ORC tables and Phoenix views on top of this Hbase
>> tables.
>>
>>
>>1. Phoenix offers the fastest response if used on top of Hbase.
>>unfortunately, Spark 2 with Phoenix is broken
>>2. Spark on Hive on Hbase looks OK. This works fine with Spark 2
>>3. Spark on Hbase tables directly using key, value DFs for each
>>column. Not as fast as 2 but works. I don't think a DF is a good choice 
>> for
>>a key, value pair?
>>
>> Now if I use Zeppelin to read from Hbase
>>
>>
>>1. I can use Phoenix JDBC. That looks very fast
>>2. I can use Spark csv directly on HDFS csv files.
>>3. I can use Spark on Hive tables
>>
>>
>> If I use Tableau on Hbase data then, only sql like code is useful.
>> Phoenix or Hive
>>
>> I don't want to change the design now. But admittedly Hive is the best
>> SQL on top of Hbase. Next release of Hive is going to have in-memory
>> database (LLAP) so we can cache Hive tables in memory. That will be faster.
>> Many people underestimate Hive but I still believe it has a lot to offer
>> besides serious ANSI compliant SQL.
>>
>> Regards
>>
>>  Mich
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be 

Re: Accessing Hbase tables through Spark, this seems to work

2016-10-17 Thread ayan guha
I do not see a rationale to have hbase in this scheme of thingsmay be I
am missing something?

If data is delivered in HDFS, why not just add partition to an existing
Hive table?

On Tue, Oct 18, 2016 at 8:23 AM, Mich Talebzadeh 
wrote:

> Thanks Mike,
>
> My test csv data comes as
>
> UUID, ticker,  timecreated,  price
> a2c844ed-137f-4820-aa6e-c49739e46fa6, S01, 2016-10-17T22:02:09,
> 53.36665625650533484995
> a912b65e-b6bc-41d4-9e10-d6a44ea1a2b0, S02, 2016-10-17T22:02:09,
> 86.31917515824627016510
> 5f4e3a9d-05cc-41a2-98b3-40810685641e, S03, 2016-10-17T22:02:09,
> 95.48298277703729129559
>
>
> And this is my Hbase table with one column family
>
> create 'marketDataHbase', 'price_info'
>
> It is populated every 15 minutes from test.csv files delivered via Kafka
> and Flume to HDFS
>
>
>1. Create a fat csv file based on all small csv files for today -->
>prices/2016-10-17
>2. Populate data into Hbase table using 
> org.apache.hadoop.hbase.mapreduce.ImportTsv
>
>3. This is pretty quick using MapReduce
>
>
> That importTsv only appends new rows to Hbase table as the choice of UUID
> as rowKey avoids any issues.
>
> So I only have 15 minutes lag in my batch Hbase table.
>
> I have both Hive ORC tables and Phoenix views on top of this Hbase tables.
>
>
>1. Phoenix offers the fastest response if used on top of Hbase.
>unfortunately, Spark 2 with Phoenix is broken
>2. Spark on Hive on Hbase looks OK. This works fine with Spark 2
>3. Spark on Hbase tables directly using key, value DFs for each
>column. Not as fast as 2 but works. I don't think a DF is a good choice for
>a key, value pair?
>
> Now if I use Zeppelin to read from Hbase
>
>
>1. I can use Phoenix JDBC. That looks very fast
>2. I can use Spark csv directly on HDFS csv files.
>3. I can use Spark on Hive tables
>
>
> If I use Tableau on Hbase data then, only sql like code is useful. Phoenix
> or Hive
>
> I don't want to change the design now. But admittedly Hive is the best SQL
> on top of Hbase. Next release of Hive is going to have in-memory database
> (LLAP) so we can cache Hive tables in memory. That will be faster. Many
> people underestimate Hive but I still believe it has a lot to offer besides
> serious ANSI compliant SQL.
>
> Regards
>
>  Mich
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 17 October 2016 at 21:54, Michael Segel 
> wrote:
>
>> Mitch,
>>
>> Short answer… no, it doesn’t scale.
>>
>> Longer answer…
>>
>> You are using an UUID as the row key?  Why?  (My guess is that you want
>> to avoid hot spotting)
>>
>> So you’re going to have to pull in all of the data… meaning a full table
>> scan… and then perform a sort order transformation, dropping the UUID in
>> the process.
>>
>> You would be better off not using HBase and storing the data in Parquet
>> files in a directory partitioned on date.  Or rather the rowkey would be
>> the max_ts - TS so that your data is in LIFO.
>> Note: I’ve used the term epoch to describe the max value of a long (8
>> bytes of ‘FF’ ) for the max_ts. This isn’t a good use of the term epoch,
>> but if anyone has a better term, please let me know.
>>
>>
>>
>> Having said that… if you want to use HBase, you could do the same thing.
>> If you want to avoid hot spotting, you could load the day’s transactions
>> using a bulk loader so that you don’t have to worry about splits.
>>
>> But that’s just my $0.02 cents worth.
>>
>> HTH
>>
>> -Mike
>>
>> PS. If you wanted to capture the transactions… you could do the following
>> schemea:
>>
>> 1) Rowkey = max_ts - TS
>> 2) Rows contain the following:
>> CUSIP (Transaction ID)
>> Party 1 (Seller)
>> Party 2 (Buyer)
>> Symbol
>> Qty
>> Price
>>
>> This is a trade ticket.
>>
>>
>>
>> On Oct 16, 2016, at 1:37 PM, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>> I have trade data stored in Hbase table. Data arrives in csv format to
>> HDFS and then loaded into Hbase via periodic load with
>> org.apache.hadoop.hbase.mapreduce.ImportTsv.
>>
>> The Hbase table has one Column family "trade_info" and three columns:
>> ticker, timecreated, price.
>>
>> The RowKey is UUID. So each row has UUID, ticker, timecreated and price
>> in the csv file
>>
>> Each row in Hbase is a key, value map. In my case, I have one 

Re: Accessing Hbase tables through Spark, this seems to work

2016-10-17 Thread Mich Talebzadeh
Thanks Mike,

My test csv data comes as

UUID, ticker,  timecreated,  price
a2c844ed-137f-4820-aa6e-c49739e46fa6, S01, 2016-10-17T22:02:09,
53.36665625650533484995
a912b65e-b6bc-41d4-9e10-d6a44ea1a2b0, S02, 2016-10-17T22:02:09,
86.31917515824627016510
5f4e3a9d-05cc-41a2-98b3-40810685641e, S03, 2016-10-17T22:02:09,
95.48298277703729129559


And this is my Hbase table with one column family

create 'marketDataHbase', 'price_info'

It is populated every 15 minutes from test.csv files delivered via Kafka
and Flume to HDFS


   1. Create a fat csv file based on all small csv files for today -->
   prices/2016-10-17
   2. Populate data into Hbase table using
   org.apache.hadoop.hbase.mapreduce.ImportTsv
   3. This is pretty quick using MapReduce


That importTsv only appends new rows to Hbase table as the choice of UUID
as rowKey avoids any issues.

So I only have 15 minutes lag in my batch Hbase table.

I have both Hive ORC tables and Phoenix views on top of this Hbase tables.


   1. Phoenix offers the fastest response if used on top of Hbase.
   unfortunately, Spark 2 with Phoenix is broken
   2. Spark on Hive on Hbase looks OK. This works fine with Spark 2
   3. Spark on Hbase tables directly using key, value DFs for each column.
   Not as fast as 2 but works. I don't think a DF is a good choice for a key,
   value pair?

Now if I use Zeppelin to read from Hbase


   1. I can use Phoenix JDBC. That looks very fast
   2. I can use Spark csv directly on HDFS csv files.
   3. I can use Spark on Hive tables


If I use Tableau on Hbase data then, only sql like code is useful. Phoenix
or Hive

I don't want to change the design now. But admittedly Hive is the best SQL
on top of Hbase. Next release of Hive is going to have in-memory database
(LLAP) so we can cache Hive tables in memory. That will be faster. Many
people underestimate Hive but I still believe it has a lot to offer besides
serious ANSI compliant SQL.

Regards

 Mich
















Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 17 October 2016 at 21:54, Michael Segel 
wrote:

> Mitch,
>
> Short answer… no, it doesn’t scale.
>
> Longer answer…
>
> You are using an UUID as the row key?  Why?  (My guess is that you want to
> avoid hot spotting)
>
> So you’re going to have to pull in all of the data… meaning a full table
> scan… and then perform a sort order transformation, dropping the UUID in
> the process.
>
> You would be better off not using HBase and storing the data in Parquet
> files in a directory partitioned on date.  Or rather the rowkey would be
> the max_ts - TS so that your data is in LIFO.
> Note: I’ve used the term epoch to describe the max value of a long (8
> bytes of ‘FF’ ) for the max_ts. This isn’t a good use of the term epoch,
> but if anyone has a better term, please let me know.
>
>
>
> Having said that… if you want to use HBase, you could do the same thing.
> If you want to avoid hot spotting, you could load the day’s transactions
> using a bulk loader so that you don’t have to worry about splits.
>
> But that’s just my $0.02 cents worth.
>
> HTH
>
> -Mike
>
> PS. If you wanted to capture the transactions… you could do the following
> schemea:
>
> 1) Rowkey = max_ts - TS
> 2) Rows contain the following:
> CUSIP (Transaction ID)
> Party 1 (Seller)
> Party 2 (Buyer)
> Symbol
> Qty
> Price
>
> This is a trade ticket.
>
>
>
> On Oct 16, 2016, at 1:37 PM, Mich Talebzadeh 
> wrote:
>
> Hi,
>
> I have trade data stored in Hbase table. Data arrives in csv format to
> HDFS and then loaded into Hbase via periodic load with
> org.apache.hadoop.hbase.mapreduce.ImportTsv.
>
> The Hbase table has one Column family "trade_info" and three columns:
> ticker, timecreated, price.
>
> The RowKey is UUID. So each row has UUID, ticker, timecreated and price in
> the csv file
>
> Each row in Hbase is a key, value map. In my case, I have one Column
> Family and three columns. Without going into semantics I see Hbase as a
> column oriented database where column data stay together.
>
> So I thought of this way of accessing the data.
>
> I define an RDD for each column in the column family as below. In this
> case column trade_info:ticker
>
> //create rdd
> val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
> classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
> classOf[org.apache.hadoop.hbase.client.Result])

Re: Accessing Hbase tables through Spark, this seems to work

2016-10-17 Thread Michael Segel
Mitch,

Short answer… no, it doesn’t scale.

Longer answer…

You are using an UUID as the row key?  Why?  (My guess is that you want to 
avoid hot spotting)

So you’re going to have to pull in all of the data… meaning a full table scan… 
and then perform a sort order transformation, dropping the UUID in the process.

You would be better off not using HBase and storing the data in Parquet files 
in a directory partitioned on date.  Or rather the rowkey would be the max_ts - 
TS so that your data is in LIFO.
Note: I’ve used the term epoch to describe the max value of a long (8 bytes of 
‘FF’ ) for the max_ts. This isn’t a good use of the term epoch, but if anyone 
has a better term, please let me know.



Having said that… if you want to use HBase, you could do the same thing.  If 
you want to avoid hot spotting, you could load the day’s transactions using a 
bulk loader so that you don’t have to worry about splits.

But that’s just my $0.02 cents worth.

HTH

-Mike

PS. If you wanted to capture the transactions… you could do the following 
schemea:

1) Rowkey = max_ts - TS
2) Rows contain the following:
CUSIP (Transaction ID)
Party 1 (Seller)
Party 2 (Buyer)
Symbol
Qty
Price

This is a trade ticket.



On Oct 16, 2016, at 1:37 PM, Mich Talebzadeh 
> wrote:

Hi,

I have trade data stored in Hbase table. Data arrives in csv format to HDFS and 
then loaded into Hbase via periodic load with 
org.apache.hadoop.hbase.mapreduce.ImportTsv.

The Hbase table has one Column family "trade_info" and three columns: ticker, 
timecreated, price.

The RowKey is UUID. So each row has UUID, ticker, timecreated and price in the 
csv file

Each row in Hbase is a key, value map. In my case, I have one Column Family and 
three columns. Without going into semantics I see Hbase as a column oriented 
database where column data stay together.

So I thought of this way of accessing the data.

I define an RDD for each column in the column family as below. In this case 
column trade_info:ticker

//create rdd
val hBaseRDD = sc.newAPIHadoopRDD(conf, 
classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],classOf[org.apache.hadoop.hbase.client.Result])
val rdd1 = hBaseRDD.map(tuple => tuple._2).map(result => (result.getRow, 
result.getColumn("price_info".getBytes(), "ticker".getBytes(.map(row => {
(
  row._1.map(_.toChar).mkString,
  row._2.asScala.reduceLeft {
(a, b) => if (a.getTimestamp > b.getTimestamp) a else b
  }.getValue.map(_.toChar).mkString
)
})
case class columns (key: String, ticker: String)
val dfticker = rdd1.toDF.map(p => columns(p(0).toString,p(1).toString))

Note that the end result is a DataFrame with the RowKey -> key and column -> 
ticker

I use the same approach to create two other DataFrames, namely dftimecreated 
and dfprice for the two other columns.

Note that if I don't need a column, then I do not create a DF for it. So a DF 
with each column I use. I am not sure how this compares if I read the full row 
through other methods if any.

Anyway all I need to do after creating a DataFrame for each column is to join 
themthrough RowKey to slice and dice data. Like below.

Get me the latest prices ordered by timecreated and ticker (ticker is stock)

val rs = 
dfticker.join(dftimecreated,"key").join(dfprice,"key").orderBy('timecreated 
desc, 'price desc).select('timecreated, 'ticker, 
'price.cast("Float").as("Latest price"))
rs.show(10)

+---+--++
|timecreated|ticker|Latest price|
+---+--++
|2016-10-16T18:44:57|   S16|   97.631966|
|2016-10-16T18:44:57|   S13|92.11406|
|2016-10-16T18:44:57|   S19|85.93021|
|2016-10-16T18:44:57|   S09|   85.714645|
|2016-10-16T18:44:57|   S15|82.38932|
|2016-10-16T18:44:57|   S17|80.77747|
|2016-10-16T18:44:57|   S06|79.81854|
|2016-10-16T18:44:57|   S18|74.10128|
|2016-10-16T18:44:57|   S07|66.13622|
|2016-10-16T18:44:57|   S20|60.35727|
+---+--++
only showing top 10 rows

Is this a workable solution?


Thanks


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.