Hi Ayan,
Sounds like the row key has to be unique much like a primary key in RDBMS
This is what I download as a csv for stock from Google Finance
Date Open High Low Close Volume
27-Sep-16 177.4 177.75 172.5 177.75 24117196
So What I do I add the stock and ticker myself to end of the row via shell
script and get rid of header
sed -i 1d tsco.csv; cat tsco.csv|awk '{print $0,",TESCO PLC,TSCO"}' >
new.csv
The New table has two column families: stock_price, stock_info and row key
date (one row per date)
This creates a new csv file with two additional columns appended to the end
of each line
Then I run the following command
$HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
-Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
stock_daily:open, stock_daily:high, stock_daily:low, stock_daily:close,
stock_daily:volume, stock_info:stock, stock_info:ticker" tsco
hdfs://rhes564:9000/data/stocks/tsco.csv
This is in Hbase table for a given day
hbase(main):090:0> scan 'tsco', LIMIT => 10
ROW COLUMN+CELL
1-Apr-08
column=stock_daily:close, timestamp=1475492248665, value=405.25
1-Apr-08
column=stock_daily:high, timestamp=1475492248665, value=406.75
1-Apr-08
column=stock_daily:low, timestamp=1475492248665, value=379.25
1-Apr-08
column=stock_daily:open, timestamp=1475492248665, value=380.00
1-Apr-08
column=stock_daily:volume, timestamp=1475492248665, value=49664486
1-Apr-08
column=stock_info:stock, timestamp=1475492248665, value=TESCO PLC
1-Apr-08
column=stock_info:ticker, timestamp=1475492248665, value=TSCO
But I also have this at the bottom
Date
column=stock_daily:close, timestamp=1475491189158, value=Close
Date
column=stock_daily:high, timestamp=1475491189158, value=High
Date
column=stock_daily:low, timestamp=1475491189158, value=Low
Date
column=stock_daily:open, timestamp=1475491189158, value=Open
Date
column=stock_daily:volume, timestamp=1475491189158, value=Volume
Date
column=stock_info:stock, timestamp=1475491189158, value=TESCO PLC
Date
column=stock_info:ticker, timestamp=1475491189158, value=TSCO
Sounds like the table header?
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.
On 3 October 2016 at 11:24, ayan guha <[email protected]> wrote:
> I am not well versed with importtsv, but you can create a CSV file using a
> simple spark program to create first column as ticker+tradedate. I remember
> doing similar manipulation to create row key format in pig.
> On 3 Oct 2016 20:40, "Mich Talebzadeh" <[email protected]> wrote:
>
>> Thanks Ayan,
>>
>> How do you specify ticker+rtrade as row key in the below
>>
>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
>> stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:h
>> igh,stock_daily:low,stock_daily:close,stock_daily:volume" tsco
>> hdfs://rhes564:9000/data/stocks/tsco.csv
>>
>> I always thought that Hbase will take the first column as row key so it
>> takes stock as the row key which is tsco plc for every row!
>>
>> Does row key need to be unique?
>>
>> cheers
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn *
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 3 October 2016 at 10:30, ayan guha <[email protected]> wrote:
>>
>>> Hi Mitch
>>>
>>> It is more to do with hbase than spark.
>>>
>>> Row key can be anything, yes but essentially what you are doing is
>>> insert and update tesco PLC row. Given your schema, ticker+trade date seems
>>> to be a good row key
>>> On 3 Oct 2016 18:25, "Mich Talebzadeh" <[email protected]>
>>> wrote:
>>>
>>>> thanks again.
>>>>
>>>> I added that jar file to the classpath and that part worked.
>>>>
>>>> I was using spark shell so I have to use spark-submit for it to be able
>>>> to interact with map-reduce job.
>>>>
>>>> BTW when I use the command line utility ImportTsv to load a file into
>>>> Hbase with the following table format
>>>>
>>>> describe 'marketDataHbase'
>>>> Table marketDataHbase is ENABLED
>>>> marketDataHbase
>>>> COLUMN FAMILIES DESCRIPTION
>>>> {NAME => 'price_info', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY
>>>> => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE',
>>>> TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKC
>>>> ACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
>>>> 1 row(s) in 0.0930 seconds
>>>>
>>>>
>>>> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
>>>> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,
>>>> stock_daily:ticker, stock_daily:tradedate, stock_daily:open,stock_daily:h
>>>> igh,stock_daily:low,stock_daily:close,stock_daily:volume" tsco
>>>> hdfs://rhes564:9000/data/stocks/tsco.csv
>>>>
>>>> There are with 1200 rows in the csv file,* but it only loads the first
>>>> row!*
>>>>
>>>> scan 'tsco'
>>>> ROW COLUMN+CELL
>>>> Tesco PLC
>>>> column=stock_daily:close, timestamp=1475447365118, value=325.25
>>>> Tesco PLC
>>>> column=stock_daily:high, timestamp=1475447365118, value=332.00
>>>> Tesco PLC
>>>> column=stock_daily:low, timestamp=1475447365118, value=324.00
>>>> Tesco PLC
>>>> column=stock_daily:open, timestamp=1475447365118, value=331.75
>>>> Tesco PLC
>>>> column=stock_daily:ticker, timestamp=1475447365118, value=TSCO
>>>> Tesco PLC
>>>> column=stock_daily:tradedate, timestamp=1475447365118, value= 3-Jan-06
>>>> Tesco PLC
>>>> column=stock_daily:volume, timestamp=1475447365118, value=46935045
>>>> 1 row(s) in 0.0390 seconds
>>>>
>>>> Is this because the hbase_row_key --> Tesco PLC is the same for all? I
>>>> thought that the row key can be anything.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn *
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 3 October 2016 at 07:44, Benjamin Kim <[email protected]> wrote:
>>>>
>>>>> We installed Apache Spark 1.6.0 at the time alongside CDH 5.4.8
>>>>> because Cloudera only had Spark 1.3.0 at the time, and we wanted to use
>>>>> Spark 1.6.0’s features. We borrowed the /etc/spark/conf/spark-env.sh file
>>>>> that Cloudera generated because it was customized to add jars first from
>>>>> paths listed in the file /etc/spark/conf/classpath.txt. So, we entered the
>>>>> path for the htrace jar into the /etc/spark/conf/classpath.txt file. Then,
>>>>> it worked. We could read/write to HBase.
>>>>>
>>>>> On Oct 2, 2016, at 12:52 AM, Mich Talebzadeh <
>>>>> [email protected]> wrote:
>>>>>
>>>>> Thanks Ben
>>>>>
>>>>> The thing is I am using Spark 2 and no stack from CDH!
>>>>>
>>>>> Is this approach to reading/writing to Hbase specific to Cloudera?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>> LinkedIn *
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 1 October 2016 at 23:39, Benjamin Kim <[email protected]> wrote:
>>>>>
>>>>>> Mich,
>>>>>>
>>>>>> I know up until CDH 5.4 we had to add the HTrace jar to the classpath
>>>>>> to make it work using the command below. But after upgrading to CDH 5.7,
>>>>>> it
>>>>>> became unnecessary.
>>>>>>
>>>>>> echo "/opt/cloudera/parcels/CDH/jars/htrace-core-3.2.0-incubating.jar"
>>>>>> >> /etc/spark/conf/classpath.txt
>>>>>>
>>>>>> Hope this helps.
>>>>>>
>>>>>> Cheers,
>>>>>> Ben
>>>>>>
>>>>>>
>>>>>> On Oct 1, 2016, at 3:22 PM, Mich Talebzadeh <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>> Trying bulk load using Hfiles in Spark as below example:
>>>>>>
>>>>>> import org.apache.spark._
>>>>>> import org.apache.spark.rdd.NewHadoopRDD
>>>>>> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
>>>>>> import org.apache.hadoop.hbase.client.HBaseAdmin
>>>>>> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>>>>>> import org.apache.hadoop.fs.Path;
>>>>>> import org.apache.hadoop.hbase.HColumnDescriptor
>>>>>> import org.apache.hadoop.hbase.util.Bytes
>>>>>> import org.apache.hadoop.hbase.client.Put;
>>>>>> import org.apache.hadoop.hbase.client.HTable;
>>>>>> import org.apache.hadoop.hbase.mapred.TableOutputFormat
>>>>>> import org.apache.hadoop.mapred.JobConf
>>>>>> import org.apache.hadoop.hbase.io.ImmutableBytesWritable
>>>>>> import org.apache.hadoop.mapreduce.Job
>>>>>> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
>>>>>> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
>>>>>> import org.apache.hadoop.hbase.KeyValue
>>>>>> import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
>>>>>> import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
>>>>>>
>>>>>> So far no issues.
>>>>>>
>>>>>> Then I do
>>>>>>
>>>>>> val conf = HBaseConfiguration.create()
>>>>>> conf: org.apache.hadoop.conf.Configuration = Configuration:
>>>>>> core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
>>>>>> yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml
>>>>>> val tableName = "testTable"
>>>>>> tableName: String = testTable
>>>>>>
>>>>>> But this one fails:
>>>>>>
>>>>>> scala> val table = new HTable(conf, tableName)
>>>>>> java.io.IOException: java.lang.reflect.InvocationTargetException
>>>>>> at org.apache.hadoop.hbase.client.ConnectionFactory.createConne
>>>>>> ction(ConnectionFactory.java:240)
>>>>>> at org.apache.hadoop.hbase.client.ConnectionManager.createConne
>>>>>> ction(ConnectionManager.java:431)
>>>>>> at org.apache.hadoop.hbase.client.ConnectionManager.createConne
>>>>>> ction(ConnectionManager.java:424)
>>>>>> at org.apache.hadoop.hbase.client.ConnectionManager.getConnecti
>>>>>> onInternal(ConnectionManager.java:302)
>>>>>> at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:185)
>>>>>> at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:151)
>>>>>> ... 52 elided
>>>>>> Caused by: java.lang.reflect.InvocationTargetException:
>>>>>> java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>>>>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>>>> Method)
>>>>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance(Native
>>>>>> ConstructorAccessorImpl.java:62)
>>>>>> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(De
>>>>>> legatingConstructorAccessorImpl.java:45)
>>>>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>>>>> at org.apache.hadoop.hbase.client.ConnectionFactory.createConne
>>>>>> ction(ConnectionFactory.java:238)
>>>>>> ... 57 more
>>>>>> Caused by: java.lang.NoClassDefFoundError: org/apache/htrace/Trace
>>>>>> at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exist
>>>>>> s(RecoverableZooKeeper.java:216)
>>>>>> at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.
>>>>>> java:419)
>>>>>> at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZ
>>>>>> Node(ZKClusterId.java:65)
>>>>>> at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterI
>>>>>> d(ZooKeeperRegistry.java:105)
>>>>>> at org.apache.hadoop.hbase.client.ConnectionManager$HConnection
>>>>>> Implementation.retrieveClusterId(ConnectionManager.java:905)
>>>>>> at org.apache.hadoop.hbase.client.ConnectionManager$HConnection
>>>>>> Implementation.<init>(ConnectionManager.java:648)
>>>>>> ... 62 more
>>>>>> Caused by: java.lang.ClassNotFoundException: org.apache.htrace.Trace
>>>>>>
>>>>>> I have got all the jar files in spark-defaults.conf
>>>>>>
>>>>>> spark.driver.extraClassPath /home/hduser/jars/ojdbc6.jar:/
>>>>>> home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1
>>>>>> .2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hdus
>>>>>> er/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-proto
>>>>>> col-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/
>>>>>> hduser/jars/hive-hbase-handler-2.1.0.jar
>>>>>> spark.executor.extraClassPath /home/hduser/jars/ojdbc6.jar:/
>>>>>> home/hduser/jars/jconn4.jar:/home/hduser/jars/hbase-client-1
>>>>>> .2.3.jar:/home/hduser/jars/hbase-server-1.2.3.jar:/home/hdus
>>>>>> er/jars/hbase-common-1.2.3.jar:/home/hduser/jars/hbase-proto
>>>>>> col-1.2.3.jar:/home/hduser/jars/htrace-core-3.0.4.jar:/home/
>>>>>> hduser/jars/hive-hbase-handler-2.1.0.jar
>>>>>>
>>>>>>
>>>>>> and also in Spark shell where I test the code
>>>>>>
>>>>>> --jars /home/hduser/jars/hbase-client-1.2.3.jar,/home/hduser/jars/h
>>>>>> base-server-1.2.3.jar,/home/hduser/jars/hbase-common-1.2.3.j
>>>>>> ar,/home/hduser/jars/hbase-protocol-1.2.3.jar,/home/hduser/j
>>>>>> ars/htrace-core-3.0.4.jar,/home/hduser/jars/hive-hbase-handl
>>>>>> er-2.1.0.jar'
>>>>>>
>>>>>> So any ideas will be appreciated.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>> LinkedIn *
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>