Hi there,
We hit a possible issue with Pig (version 0.9.1) and HBaseStorage where we try
to LOAD multiple sets of data and UNION them. Here's a simple example that
shows the problem:
HBase Data (use hbase shell to create table and add rows):
create 'test', {NAME => 'data', VERSIONS => 1}
put 'test', '11111', 'data:value', '1'
put 'test', '11112', 'data:value', '2'
put 'test', '11113', 'data:value', '3'
put 'test', '22221', 'data:value', '4'
put 'test', '22222', 'data:value', '5'
put 'test', '22223', 'data:value', '6'
Pig Statements (create file test.pig):
load1 = LOAD 'hbase://test' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('data:*','-loadKey -gte 11110
-lte 22220') AS (key:chararray, map:map[]);
load2 = LOAD 'hbase://test' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('data:*','-loadKey -gte 22220
-lte 33330') AS (key:chararray, map:map[]);
result = UNION load1, load2;
dump result;
Run Script:
pig -x local test.pig
Result:
(11111,[value#1])
(11112,[value#2])
(11113,[value#3])
(11111,[value#1])
(11112,[value#2])
(11113,[value#3])
The result should be the following:
(11111,[value#1])
(11112,[value#2])
(11113,[value#3])
(22221,[value#4])
(22222,[value#5])
(22223,[value#6])
If we dump load1 or load2 we see the results we expect, but when the UNION is
performed, it does not put the expected data together.
Is this a known issue with Pig/HBaseStorage or are we not using them as we
should?
If it's a usage problem, what would be the proper way of loading multiple sets
of data and union them?
Thanks in advance.
Eduardo.