Hi there.
I am having some trouble trying to use Pig and a Python UDF function
and I was wondering if someone could shed a light into what I am doing
wrong. It seems that Pig has some issues trying to handle a bag of
tuples returned by a python UDF as it is getting the following
ClassCastExcetion:
java.lang.ClassCastException: org.apache.pig.data.BinSedesTuple cannot
be cast to org.apache.pig.data.DataBag
Bellow I pasted the simplest code I could come up with that
exemplifies what I am trying to do.
A = LOAD 'data' AS (url:chararray,outlink:chararray);
-- (www.ccc.com,www.hjk.com)
-- (www.ddd.com,www.xyz.org)
-- (www.aaa.com,www.cvn.org)
-- (www.www.com,www.kpt.net)
-- (www.www.com,www.xyz.org)
-- (www.ddd.com,www.xyz.org)
B = GROUP A BY url;
-- (www.aaa.com,{(www.aaa.com,www.cvn.org)})
-- (www.ccc.com,{(www.ccc.com,www.hjk.com)})
-- (www.ddd.com,{(www.ddd.com,www.xyz.org),(www.ddd.com,www.xyz.org)})
-- (www.www.com,{(www.www.com,www.kpt.net),(www.www.com,www.xyz.org)})
C = foreach B generate group, COUNT(A);
dump C;
-- (www.aaa.com,1)
-- (www.ccc.com,1)
-- (www.ddd.com,2)
-- (www.www.com,2)
-- OK, fine 'till here. Let's try with a UDF:
register 'my_udf.py' using jython as sample;
-- @outputSchema("res:{t:(value:chararray)}")
-- def test_list_of_items():
-- return tuple([i for i in range(5)])
H = foreach B generate group, my_udf.test_list_of_items();
DUMP H;
-- (www.aaa.com,(0,1,2,3,4))
-- (www.ccc.com,(0,1,2,3,4))
-- (www.ddd.com,(0,1,2,3,4))
-- (www.www.com,(0,1,2,3,4))
DESCRIBE H;
-- H: {group: chararray,res: {t: (value: chararray)}}
I = FOREACH H generate group, COUNT(res);
DUMP I
-- POWWWW!
-- org.apache.pig.backend.executionengine.ExecException: ERROR 2106:
Error while computing count in COUNT
-- at org.apache.pig.builtin.COUNT.exec(COUNT.java:74)
-- (...)
-- Caused by: java.lang.ClassCastException:
org.apache.pig.data.BinSedesTuple cannot be cast to
org.apache.pig.data.DataBag
Background: Most of the data I have to work with is generated by
Hadoop Streaming apps and consists in tab-separated pairs of
JSON-encoded data. If I am not mistaken, this is not something that
could be parsed by JsonLoader or ElephantBird directly so I wrote a
python UDF to decode the JSON data but I was unable to use it as
expected in Pig code. I kept getting strange errors even though my
data was in the same schema as data used in some documentation
examples and I was doing exactly the same kind of manipulations.
Cheers.
Tiago Alves Macambira