That sounds interesting, would love to learn more about it.

Mitch: looks good. Lastly I would suggest you to think if you really need
multiple column families.
On 4 Oct 2016 02:57, "Benjamin Kim" <> wrote:

> Lately, I’ve been experimenting with Kudu. It has been a much better
> experience than with HBase. Using it is much simpler, even from spark-shell.
> spark-shell --packages org.apache.kudu:kudu-spark_2.10:1.0.0
> It’s like going back to rudimentary DB systems where tables have just a
> primary key and the columns. Additional benefits include a home-grown spark
> package, fast upserts and table scans for analytics, time-series support
> just introduced, and (my favorite) simpler configuration and
> administration. It has just gone to version 1.0.0; so, I’m waiting for
> 1.0.1+ before I propose it as our HBase replacement for some bugs to shake
> out. All my performance tests have been stellar versus HBase especially
> with its simplicity.
> Just a thought…
> Cheers,
> Ben
> On Oct 3, 2016, at 8:40 AM, Mich Talebzadeh <>
> wrote:
> Hi,
> I decided to create a composite key *ticker-date* from the csv file
> I just did some manipulation on CSV file
> export IFS=",";sed -i 1d tsco.csv; cat tsco.csv | while read a b c d e f;
> do echo "TSCO-$a,TESCO PLC,TSCO,$a,$b,$c,$d,$e,$f"; done > temp; mv -f temp
> tsco.csv
> Which basically takes the csv file, tells the shell that field separator
> IFS=",", drops the header, reads every field in every line (1,b,c ..),
> creates the composite key TSCO-$a, adds the stock name and ticker to the
> csv file. The whole process can be automated and parameterised.
> Once the csv file is put into HDFS then, I run the following command
> $HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW
> _KEY,stock_info:stock,stock_info:ticker,stock_daily:Date,sto
> ck_daily:open,stock_daily:high,stock_daily:low,stock_daily:c
> lose,stock_daily:volume" tsco hdfs://rhes564:9000/data/stocks/tsco.csv
> The Hbase table is created as below
> create 'tsco','stock_info','stock_daily'
> and this is the data (2 rows each 2 family and with 8 attributes)
> hbase(main):132:0> scan 'tsco', LIMIT => 2
> ROW                                                    COLUMN+CELL
>  TSCO-1-Apr-08
> column=stock_daily:Date, timestamp=1475507091676, value=1-Apr-08
>  TSCO-1-Apr-08
> column=stock_daily:close, timestamp=1475507091676, value=405.25
>  TSCO-1-Apr-08
> column=stock_daily:high, timestamp=1475507091676, value=406.75
>  TSCO-1-Apr-08
> column=stock_daily:low, timestamp=1475507091676, value=379.25
>  TSCO-1-Apr-08
> column=stock_daily:open, timestamp=1475507091676, value=380.00
>  TSCO-1-Apr-08
> column=stock_daily:volume, timestamp=1475507091676, value=49664486
>  TSCO-1-Apr-08
> column=stock_info:stock, timestamp=1475507091676, value=TESCO PLC
>  TSCO-1-Apr-08
> column=stock_info:ticker, timestamp=1475507091676, value=TSCO
>  TSCO-1-Apr-09
> column=stock_daily:Date, timestamp=1475507091676, value=1-Apr-09
>  TSCO-1-Apr-09
> column=stock_daily:close, timestamp=1475507091676, value=333.30
>  TSCO-1-Apr-09
> column=stock_daily:high, timestamp=1475507091676, value=334.60
>  TSCO-1-Apr-09
> column=stock_daily:low, timestamp=1475507091676, value=326.50
>  TSCO-1-Apr-09
> column=stock_daily:open, timestamp=1475507091676, value=331.10
>  TSCO-1-Apr-09
> column=stock_daily:volume, timestamp=1475507091676, value=24877341
>  TSCO-1-Apr-09
> column=stock_info:stock, timestamp=1475507091676, value=TESCO PLC
>  TSCO-1-Apr-09
> column=stock_info:ticker, timestamp=1475507091676, value=TSCO
> Any suggestions
> Thanks
> Dr Mich Talebzadeh
> LinkedIn * 
> <>*
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
> On 3 October 2016 at 14:42, Mich Talebzadeh <>
> wrote:
>> or may be add ticker+date like similar
>> <image.png>
>> So the new row key would be TSCO-1-Apr-08
>> and this will be added as row key. Both Date and ticker will stay as they
>> are as column family attributes?
>> Dr Mich Talebzadeh
>> LinkedIn * 
>> <>*
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>> On 3 October 2016 at 14:32, Mich Talebzadeh <>
>> wrote:
>>> with ticker+date I can c reate something like below for row key
>>> TSCO_1-Apr-08
>>> or TSCO1-Apr-08
>>> if I understood you correctly
>>> Dr Mich Talebzadeh
>>> LinkedIn * 
>>> <>*
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>> On 3 October 2016 at 13:13, ayan guha <> wrote:
>>>> Hi
>>>> Looks like you are saving to new.csv but still loading tsco.csv? Its
>>>> definitely the header.
>>>> Suggestion: ticker+date as row key has following benefits:
>>>> 1. using ticker+date as row key will enable you to hold multiple ticker
>>>> in this single hbase table. (Think composite primary key)
>>>> 2. Using date itself as row key will lead to hotspots (Look up
>>>> hotspoting due to monotonically increasing row key). To distribute the
>>>> load, it is suggested to use a salting. Ticker can be used as a natural
>>>> salt in this case.
>>>> 3. Also, you may want to hash the rowkey value to give it little more
>>>> flexible (Think surrogate key).
>>>> On Mon, Oct 3, 2016 at 10:17 PM, Mich Talebzadeh <
>>>>> wrote:
>>>>> Hi Ayan,
>>>>> Sounds like the row key has to be unique much like a primary key in
>>>>> RDBMS
>>>>> This is what I download as a csv for stock from Google Finance
>>>>>   Date Open High Low Close Volume
>>>>> 27-Sep-16 177.4 177.75 172.5 177.75 24117196
>>>>> So What I do I add the stock and ticker myself to end of the row via
>>>>> shell script and get rid of header
>>>>> sed -i 1d tsco.csv; cat tsco.csv|awk '{print $0,",TESCO PLC,TSCO"}' >
>>>>> new.csv
>>>>> The New table has two column families: stock_price, stock_info and row
>>>>> key date (one row per date)
>>>>> This creates a new csv file with two additional columns appended to
>>>>> the end of each line
>>>>> Then I run the following command
>>>>> $HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>>>>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
>>>>> stock_daily:open, stock_daily:high, stock_daily:low, stock_daily:close,
>>>>> stock_daily:volume, stock_info:stock, stock_info:ticker" tsco
>>>>> hdfs://rhes564:9000/data/stocks/tsco.csv
>>>>> This is in Hbase table for a given day
>>>>> hbase(main):090:0> scan 'tsco', LIMIT => 10
>>>>> ROW                                                    COLUMN+CELL
>>>>>  1-Apr-08
>>>>> column=stock_daily:close, timestamp=1475492248665, value=405.25
>>>>>  1-Apr-08
>>>>> column=stock_daily:high, timestamp=1475492248665, value=406.75
>>>>>  1-Apr-08
>>>>> column=stock_daily:low, timestamp=1475492248665, value=379.25
>>>>>  1-Apr-08
>>>>> column=stock_daily:open, timestamp=1475492248665, value=380.00
>>>>>  1-Apr-08
>>>>> column=stock_daily:volume, timestamp=1475492248665, value=49664486
>>>>>  1-Apr-08
>>>>> column=stock_info:stock, timestamp=1475492248665, value=TESCO PLC
>>>>>  1-Apr-08
>>>>> column=stock_info:ticker, timestamp=1475492248665, value=TSCO
>>>>> But I also have this at the bottom
>>>>>   Date
>>>>> column=stock_daily:close, timestamp=1475491189158, value=Close
>>>>>  Date
>>>>> column=stock_daily:high, timestamp=1475491189158, value=High
>>>>>  Date
>>>>> column=stock_daily:low, timestamp=1475491189158, value=Low
>>>>>  Date
>>>>> column=stock_daily:open, timestamp=1475491189158, value=Open
>>>>>  Date
>>>>> column=stock_daily:volume, timestamp=1475491189158, value=Volume
>>>>>  Date
>>>>> column=stock_info:stock, timestamp=1475491189158, value=TESCO PLC
>>>>>  Date
>>>>> column=stock_info:ticker, timestamp=1475491189158, value=TSCO
>>>>> Sounds like the table header?
>>>>> Dr Mich Talebzadeh
>>>>> LinkedIn * 
>>>>> <>*
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>> On 3 October 2016 at 11:24, ayan guha <> wrote:
>>>>>> I am not well versed with importtsv, but you can create a CSV file
>>>>>> using a simple spark program to create first column as ticker+tradedate. 
>>>>>> I
>>>>>> remember doing similar manipulation to create row key format in pig.
>>>>>> On 3 Oct 2016 20:40, "Mich Talebzadeh" <>
>>>>>> wrote:
>>>>>>> Thanks Ayan,
>>>>>>> How do you specify ticker+rtrade as row key in the below
>>>>>>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>>>>>>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
>>>>>>> stock_daily:ticker, stock_daily:tradedate, 
>>>>>>> stock_daily:open,stock_daily:h
>>>>>>> igh,stock_daily:low,stock_daily:close,stock_daily:volume" tsco
>>>>>>> hdfs://rhes564:9000/data/stocks/tsco.csv
>>>>>>> I always thought that Hbase will take the first column as row key so
>>>>>>> it takes stock as the row key which is tsco plc for every row!
>>>>>>> Does row key need to be unique?
>>>>>>> cheers
>>>>>>> Dr Mich Talebzadeh
>>>>>>> LinkedIn * 
>>>>>>> <>*
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>> On 3 October 2016 at 10:30, ayan guha <> wrote:
>>>>>>>> Hi Mitch
>>>>>>>> It is more to do with hbase than spark.
>>>>>>>> Row key can be anything, yes but essentially what you are doing is
>>>>>>>> insert and update tesco PLC row. Given your schema, ticker+trade date 
>>>>>>>> seems
>>>>>>>> to be a good row key
>>>>>>>> On 3 Oct 2016 18:25, "Mich Talebzadeh" <>
>>>>>>>> wrote:
>>>>>>>>> thanks again.
>>>>>>>>> I added that jar file to the classpath and that part worked.
>>>>>>>>> I was using spark shell so I have to use spark-submit for it to be
>>>>>>>>> able to interact with map-reduce job.
>>>>>>>>> BTW when I use the command line utility ImportTsv  to load a file
>>>>>>>>> into Hbase with the following table format
>>>>>>>>> describe 'marketDataHbase'
>>>>>>>>> Table marketDataHbase is ENABLED
>>>>>>>>> marketDataHbase
>>>>>>>>> {NAME => 'price_info', BLOOMFILTER => 'ROW', VERSIONS => '1',
>>>>>>>>> IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', 
>>>>>>>>> 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', 
>>>>>>>>> BLOCKC
>>>>>>>>> ACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
>>>>>>>>> 1 row(s) in 0.0930 seconds
>>>>>>>>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>>>>>>>>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
>>>>>>>>> stock_daily:ticker, stock_daily:tradedate, 
>>>>>>>>> stock_daily:open,stock_daily:h
>>>>>>>>> igh,stock_daily:low,stock_daily:close,stock_daily:volume" tsco
>>>>>>>>> hdfs://rhes564:9000/data/stocks/tsco.csv
>>>>>>>>> There are with 1200 rows in the csv file,* but it only loads the
>>>>>>>>> first row!*
>>>>>>>>> scan 'tsco'
>>>>>>>>> ROW                                                    COLUMN+CELL
>>>>>>>>>  Tesco PLC
>>>>>>>>> column=stock_daily:close, timestamp=1475447365118, value=325.25
>>>>>>>>>  Tesco PLC
>>>>>>>>> column=stock_daily:high, timestamp=1475447365118, value=332.00
>>>>>>>>>  Tesco PLC
>>>>>>>>> column=stock_daily:low, timestamp=1475447365118, value=324.00
>>>>>>>>>  Tesco PLC
>>>>>>>>> column=stock_daily:open, timestamp=1475447365118, value=331.75
>>>>>>>>>  Tesco PLC
>>>>>>>>> column=stock_daily:ticker, timestamp=1475447365118, value=TSCO
>>>>>>>>>  Tesco PLC
>>>>>>>>> column=stock_daily:tradedate, timestamp=1475447365118, value= 3-Jan-06
>>>>>>>>>  Tesco PLC
>>>>>>>>> column=stock_daily:volume, timestamp=1475447365118, value=46935045
>>>>>>>>> 1 row(s) in 0.0390 seconds
>>>>>>>>> Is this because the hbase_row_key --> Tesco PLC is the same for
>>>>>>>>> all? I thought that the row key can be anything.
>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>> LinkedIn * 
>>>>>>>>> <>*
>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>>> for any loss, damage or destruction of data or any other property 
>>>>>>>>> which may
>>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>>>> damages
>>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>> On 3 October 2016 at 07:44, Benjamin Kim <>
>>>>>>>>> wrote:
>>>>>>>>>> We installed Apache Spark 1.6.0 at the time alongside CDH 5.4.8
>>>>>>>>>> because Cloudera only had Spark 1.3.0 at the time, and we wanted to 
>>>>>>>>>> use
>>>>>>>>>> Spark 1.6.0’s features. We borrowed the /etc/spark/conf/ 
>>>>>>>>>> file
>>>>>>>>>> that Cloudera generated because it was customized to add jars first 
>>>>>>>>>> from
>>>>>>>>>> paths listed in the file /etc/spark/conf/classpath.txt. So, we 
>>>>>>>>>> entered the
>>>>>>>>>> path for the htrace jar into the /etc/spark/conf/classpath.txt file. 
>>>>>>>>>> Then,
>>>>>>>>>> it worked. We could read/write to HBase.
>>>>>>>>>> On Oct 2, 2016, at 12:52 AM, Mich Talebzadeh <
>>>>>>>>>>> wrote:
>>>>>>>>>> Thanks Ben
>>>>>>>>>> The thing is I am using Spark 2 and no stack from CDH!
>>>>>>>>>> Is this approach to reading/writing to Hbase specific to Cloudera?
>>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>> LinkedIn * 
>>>>>>>>>> <>*
>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all
>>>>>>>>>> responsibility for any loss, damage or destruction of data or any 
>>>>>>>>>> other
>>>>>>>>>> property which may arise from relying on this email's technical 
>>>>>>>>>> content is
>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any
>>>>>>>>>> monetary damages arising from such loss, damage or destruction.
>>>>>>>>>> On 1 October 2016 at 23:39, Benjamin Kim <>
>>>>>>>>>> wrote:
>>>>>>>>>>> Mich,
>>>>>>>>>>> I know up until CDH 5.4 we had to add the HTrace jar to the
>>>>>>>>>>> classpath to make it work using the command below. But after 
>>>>>>>>>>> upgrading to
>>>>>>>>>>> CDH 5.7, it became unnecessary.
>>>>>>>>>>> echo "/opt/cloudera/parcels/CDH/jar
>>>>>>>>>>> s/htrace-core-3.2.0-incubating.jar" >>
>>>>>>>>>>> /etc/spark/conf/classpath.txt
>>>>>>>>>>> Hope this helps.
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Ben
>>>>>>>>>>> On Oct 1, 2016, at 3:22 PM, Mich Talebzadeh <
>>>>>>>>>>>> wrote:
>>>>>>>>>>> Trying bulk load using Hfiles in Spark as below example:
>>>>>>>>>>> import org.apache.spark._
>>>>>>>>>>> import org.apache.spark.rdd.NewHadoopRDD
>>>>>>>>>>> import org.apache.hadoop.hbase.{HBaseConfiguration,
>>>>>>>>>>> HTableDescriptor}
>>>>>>>>>>> import org.apache.hadoop.hbase.client.HBaseAdmin
>>>>>>>>>>> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>>>>>>>>>>> import org.apache.hadoop.fs.Path;
>>>>>>>>>>> import org.apache.hadoop.hbase.HColumnDescriptor
>>>>>>>>>>> import org.apache.hadoop.hbase.util.Bytes
>>>>>>>>>>> import org.apache.hadoop.hbase.client.Put;
>>>>>>>>>>> import org.apache.hadoop.hbase.client.HTable;
>>>>>>>>>>> import org.apache.hadoop.hbase.mapred.TableOutputFormat
>>>>>>>>>>> import org.apache.hadoop.mapred.JobConf
>>>>>>>>>>> import
>>>>>>>>>>> import org.apache.hadoop.mapreduce.Jo
>>>>>>>>>>> <>b
>>>>>>>>>>> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
>>>>>>>>>>> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
>>>>>>>>>>> import org.apache.hadoop.hbase.KeyValue
>>>>>>>>>>> import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
>>>>>>>>>>> import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
>>>>>>>>>>> So far no issues.
>>>>>>>>>>> Then I do
>>>>>>>>>>> val conf = HBaseConfiguration.create()
>>>>>>>>>>> conf: org.apache.hadoop.conf.Configuration = Configuration:
>>>>>>>>>>> core-default.xml, core-site.xml, mapred-default.xml, 
>>>>>>>>>>> mapred-site.xml,
>>>>>>>>>>> yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml
>>>>>>>>>>> val tableName = "testTable"
>>>>>>>>>>> tableName: String = testTable
>>>>>>>>>>> ...

Reply via email to