Hello there,
I am wondering how to get the column family names and column qualifier names
when using pyspark to read an hbase table with multiple column families.
I have a hbase table as follows,
hbase(main):007:0> scan 'data1'
ROW COLUMN+CELL
row1 column=f1:, timestamp=1411078148186, value=value1
row1 column=f2:, timestamp=1415732470877, value=value7
row2 column=f2:, timestamp=1411078160265, value=value2
when I ran the examples/hbase_inputformat.py code:
conf2 = {"hbase.zookeeper.quorum": "localhost",
"hbase.mapreduce.inputtable": 'data1'}
hbase_rdd = sc.newAPIHadoopRDD(
"org.apache.hadoop.hbase.mapreduce.TableInputFormat",
"org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"org.apache.hadoop.hbase.client.Result",
keyConverter="org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter",
valueConverter="org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter",
conf=conf2)
output = hbase_rdd.collect()
for (k, v) in output:
print (k, v)
I only see
(u'row1', u'value1')
(u'row2', u'value2')
What I really want is (row_id, column family:column qualifier, value)
tuples. Any comments? Thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]