I am seeing a bug in the pig behavior. So I went ahead and created a sample
dataset to share the bug details because I cannot share the original data and
script.
This is my sample input file
Data
John|Gary|42
Pig Script
data = LOAD 'data' USING PigStorage('|') AS (parent:chararray, child:chararray,
edge_id:chararray);
data1 = FOREACH data GENERATE parent AS node1, child AS node2, edge_id;
data2 = FOREACH data GENERATE child AS node1, parent AS node2, edge_id;
data3 = UNION data1, data2;
data4 = FOREACH data3 GENERATE node1, node2;
DESCRIBE data4;
$pig -x local bug.pig
2014-02-10 13:55:31,201 [main] INFO org.apache.pig.Main - Apache Pig version
0.10.0-cdh3u4a (rexported) compiled Sep 04 2012, 14:03:46
2014-02-10 13:55:31,201 [main] INFO org.apache.pig.Main - Logging error
messages to: /x/home/abc/pig_1392069331197.log
2014-02-10 13:55:31,452 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to
hadoop file system at: file:///<file:////>
data4: {node2: chararray,node2: chararray}
I should be getting node1 and node2 in my schema but I am getting node2 twice.
Can anyone tell me what I am doing wrong here ?
Thanks,
Ravi.
(Paypal Data Scientist)