I am seeing a bug in the pig behavior. So I went ahead and created a sample 
dataset to share the bug details because I cannot share the original data and 
script.

This is my sample input file

Data

John|Gary|42

Pig Script

data = LOAD 'data' USING PigStorage('|') AS (parent:chararray, child:chararray, 
edge_id:chararray);

data1 = FOREACH data GENERATE parent AS node1, child AS node2, edge_id;

data2 = FOREACH data GENERATE child AS node1, parent AS node2, edge_id;

data3 = UNION data1, data2;

data4 = FOREACH data3 GENERATE node1, node2;

DESCRIBE data4;

$pig -x local bug.pig

2014-02-10 13:55:31,201 [main] INFO  org.apache.pig.Main - Apache Pig version 
0.10.0-cdh3u4a (rexported) compiled Sep 04 2012, 14:03:46
2014-02-10 13:55:31,201 [main] INFO  org.apache.pig.Main - Logging error 
messages to: /x/home/abc/pig_1392069331197.log
2014-02-10 13:55:31,452 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
hadoop file system at: file:///<file:////>
data4: {node2: chararray,node2: chararray}


I should be getting node1 and node2 in my schema but I am getting node2 twice. 
Can anyone tell me what I am doing wrong here ?

Thanks,
Ravi.

(Paypal Data Scientist)

Reply via email to