Hello,
I've been experimenting with Pig using the Accumulo-Pig extension for
reading and writing data to an Accumulo table where I've run into a problem
doing a join. I'm hoping someone on this list can shed some light on what
might be going on. I have an Accumulo table called people which has the
following records loaded into it.
1 hobby: [] Home Theater
1 name: [] Rob
2 hobby: [] Tennis
2 name: [] Emma
3 hobby: [] Knitting
3 name: [] Liz
4 hobby: [] Robotics
4 name: [] Stephen
I've written a Pig script which performs data loads from the table. The
first data load loads all the records with the column family 'name' and the
second loads all the records with the column family 'hobby'. Next it does
a join on the two sets of records by the 'id' field which is the rowId in
Accumulo. However the results I get from the join are not what I expected.
I expected:
(1,name,,,1360962196153,Rob,1,hobby,,,1360962338618,Home Theater)
(2,name,,,1360962196153,Emma,2,hobby,,,1360962338618,Tennis)
(3,name,,,1360962196153,Liz,3,hobby,,,1360962338618,Knitting)
(4,name,,,1360962196153,Stephen,4,hobby,,,1360962338618,Robotics)
What I got was:
(1,hobby,,,1360962338618,Home Theater,1,hobby,,,1360962338618,Home Theater)
(2,hobby,,,1360962338618,Tennis,2,hobby,,,1360962338618,Tennis)
(3,hobby,,,1360962338618,Knitting,3,hobby,,,1360962338618,Knitting)
(4,hobby,,,1360962338618,Robotics,4,hobby,,,1360962338618,Robotics)
I've included the script below for reference. Note that after the load of
peopleRecords and hobbyRecords they are dumped and each contains the
expected contents. peopleRecords contains:
(1,name,,,1360962196153,Rob)
(2,name,,,1360962196153,Emma)
(3,name,,,1360962196153,Liz)
(4,name,,,1360962196153,Stephen)
hobbyRecords contains:
(1,hobby,,,1360962338618,Home Theater)
(2,hobby,,,1360962338618,Tennis)
(3,hobby,,,1360962338618,Knitting)
(4,hobby,,,1360962338618,Robotics)
I also did a 'describe' on the joined record set and it produces the schema
properly for the two joined records:
jnd: {peopleRecords::id: int,peopleRecords::cf:
chararray,peopleRecords::cq: chararray,peopleRecords::cv:
chararray,peopleRecords::ts: long,peopleRecords::name:
chararray,hobbyRecords::id: int,hobbyRecords::cf:
chararray,hobbyRecords::cq: chararray,hobbyRecords::cv:
chararray,hobbyRecords::ts: long,hobbyRecords::activity: chararray}
I have tried this in both local mode and mapreduce mode and they both yield
similar results. Although in localmode I usually get a recordset that has
peopleRecords joined with itself instead of hobby records joined with hobby
records.
Any ideas?
Thanks,
Rob
Script:
REGISTER /accumulo/jt6211-accumulo-pig-4c6cb82/lib/accumulo-core-1.4.0.jar
REGISTER /accumulo/jt6211-accumulo-pig-4c6cb82/lib/cloudtrace-1.4.0.jar
REGISTER /accumulo/jt6211-accumulo-pig-4c6cb82/lib/libthrift-0.6.1.jar
REGISTER /accumulo/jt6211-accumulo-pig-4c6cb82/lib/zookeeper-3.3.1.jar
REGISTER /accumulo/jt6211-accumulo-pig-4c6cb82/target/accumulo-pig-1.4.0.jar
peopleRecords = load
'accumulo://people?instance=holmes&user=XXXXX&password=XXXXXX&zookeepers=d0512b08:39704&columns=name'
using org.apache.accumulo.pig.AccumuloStorage() as (id:int,
cf:chararray, cq:chararray, cv:chararray, ts:long, name:chararray);
dump peopleRecords;
hobbyRecords = load
'accumulo://people?instance=holmes&user=XXXXX&password=XXXXXX&zookeepers=d0512b08:39704&columns=hobby'
using org.apache.accumulo.pig.AccumuloStorage() as (id:int,
cf:chararray, cq:chararray, cv:chararray, ts:long, activity:chararray);
dump hobbyRecords;
jnd = join peopleRecords by id, hobbyRecords by id;
dump jnd;
describe jnd;