Hello,
First of all, I'm new at Pig and NoSQL so I hope you'll forgive stupid
questions ;-)
So, I'm playing with OpenTSDB (software layer on top of HBase to handle
timeseries data) and now I'd like to run some data mining queries on top of
my timestamped data. I found that Pig could be a solution so I tried to make
it working on top of the openTSDB data in hbase, it neraly works but I'm
still confused.
OpenTSDB schema :
hbase(main):011:0> describe 'tsdb-uid'
DESCRIPTION
ENABLED
{NAME => 'tsdb-uid', FAMILIES => [{NAME => 'id', BLOOMFILTER => 'NONE',
REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => true
'3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}, {NAME => 'name', BLOOMFILTER => 'NONE',
REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL =>
'2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BL
OCKCACHE => 'true'}]}
hbase(main):012:0> describe 'tsdb'
DESCRIPTION
ENABLED
{NAME => 'tsdb', FAMILIES => [{NAME => 't', BLOOMFILTER => 'NONE',
REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', true
TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE
=> 'true'}]}
So sample uid data are :
hbase(main):014:0> scan 'tsdb-uid'
ROW COLUMN+CELL
\x00\x00\x01 column=name:metrics,
timestamp=1314801674803, value=proc.loadavg.1m
\x00\x00\x01 column=name:tagk,
timestamp=1314801684953, value=validity
\x00\x00\x01 column=name:tagv,
timestamp=1314801685000, value=true
\x00\x00\x02 column=name:metrics,
timestamp=1314801674849, value=proc.loadavg.5m
\x00\x00\x02 column=name:tagk,
timestamp=1314801685049, value=device
\x00\x00\x02 column=name:tagv,
timestamp=1314801685096, value=Device1
\x00\x00\x03 column=name:metrics,
timestamp=1314801674898, value=Measurement_1
\x00\x00\x03 column=name:tagk,
timestamp=1314801685144, value=accuracy
\x00\x00\x03 column=name:tagv,
timestamp=1314801693030, value=Device2
\x00\x00\x04 column=name:metrics,
timestamp=1314801674947, value=Measurement_2
\x00\x00\x05 column=name:metrics,
timestamp=1314801674994, value=Measurement_3
Device1 column=id:tagv,
timestamp=1314801685097, value=\x00\x00\x02
Device2 column=id:tagv,
timestamp=1314801693031, value=\x00\x00\x03
Measurement_1 column=id:metrics,
timestamp=1314801674899, value=\x00\x00\x03
Measurement_2 column=id:metrics,
timestamp=1314801674948, value=\x00\x00\x04
Measurement_3 column=id:metrics,
timestamp=1314801674995, value=\x00\x00\x05
accuracy column=id:tagk,
timestamp=1314801685145, value=\x00\x00\x03
device column=id:tagk,
timestamp=1314801685050, value=\x00\x00\x02
proc.loadavg.1m column=id:metrics,
timestamp=1314801674804, value=\x00\x00\x01
proc.loadavg.5m column=id:metrics,
timestamp=1314801674850, value=\x00\x00\x02
true column=id:tagv,
timestamp=1314801685002, value=\x00\x00\x01
validity column=id:tagk,
timestamp=1314801684955, value=\x00\x00\x01
Here are the metrics (timestamp data type id:metrics) and the tag defining
the data (tagk and tagv for value, ex: validity = true)
So from Pig when I want to retrieve only the metrics and their value (= id
for the data table) I do :
tsd_metrics = LOAD 'hbase://tsdb-uid' using
org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics', '-loadKey
true') AS (metrics:bytearray);
dump tsd_metrics;
HadoopVersion PigVersion UserId StartedAt FinishedAt
Features
0.20.2 0.8.1-SNAPSHOT opentsdb 2011-09-06 13:39:27 2011-09-06
13:39:34 UNKNOWN
Success!
Job Stats (time in seconds):
JobId Alias Feature Outputs
job_local_0004 tsd_metrics MAP_ONLY
file:/tmp/temp-1850282462/tmp1589556736,
Input(s):
Successfully read records from: "hbase://tsdb-uid"
Output(s):
Successfully stored records in: "file:/tmp/temp-1850282462/tmp1589556736"
Job DAG:
job_local_0004
(Measurement_1,)
(Measurement_2,)
(Measurement_3,)
(proc.loadavg.1m,)
(proc.loadavg.5m,)
so that's nealy ok except that the value (= id) displayed is null instead
of \x00\x00\x03 for example in the case of Measurement_1
Any idea ?
thx !
shazz