Hi there.

I am having some trouble trying to use Pig and a Python UDF function
and I was wondering if someone could shed a light into what I am doing
wrong. It seems that Pig has some issues trying to handle a bag of
tuples returned by a python UDF as it is getting the following
ClassCastExcetion:

java.lang.ClassCastException: org.apache.pig.data.BinSedesTuple cannot
be cast to org.apache.pig.data.DataBag

Bellow I pasted the simplest code I could come up with that
exemplifies what I am trying to do.


A = LOAD 'data' AS (url:chararray,outlink:chararray);
-- (www.ccc.com,www.hjk.com)
-- (www.ddd.com,www.xyz.org)
-- (www.aaa.com,www.cvn.org)
-- (www.www.com,www.kpt.net)
-- (www.www.com,www.xyz.org)
-- (www.ddd.com,www.xyz.org)
B = GROUP A BY url;
-- (www.aaa.com,{(www.aaa.com,www.cvn.org)})
-- (www.ccc.com,{(www.ccc.com,www.hjk.com)})
-- (www.ddd.com,{(www.ddd.com,www.xyz.org),(www.ddd.com,www.xyz.org)})
-- (www.www.com,{(www.www.com,www.kpt.net),(www.www.com,www.xyz.org)})
C = foreach B generate group, COUNT(A);
dump C;
-- (www.aaa.com,1)
-- (www.ccc.com,1)
-- (www.ddd.com,2)
-- (www.www.com,2)

-- OK, fine 'till here. Let's try with a UDF:

register 'my_udf.py' using jython as sample;
-- @outputSchema("res:{t:(value:chararray)}")
-- def test_list_of_items():
--     return tuple([i for i in range(5)])

H = foreach B generate group, my_udf.test_list_of_items();
DUMP H;
-- (www.aaa.com,(0,1,2,3,4))
-- (www.ccc.com,(0,1,2,3,4))
-- (www.ddd.com,(0,1,2,3,4))
-- (www.www.com,(0,1,2,3,4))
DESCRIBE H;
-- H: {group: chararray,res: {t: (value: chararray)}}
I = FOREACH H generate group, COUNT(res);
DUMP I
-- POWWWW!
-- org.apache.pig.backend.executionengine.ExecException: ERROR 2106:
Error while computing count in COUNT
--      at org.apache.pig.builtin.COUNT.exec(COUNT.java:74)
-- (...)
-- Caused by: java.lang.ClassCastException:
org.apache.pig.data.BinSedesTuple cannot be cast to
org.apache.pig.data.DataBag




Background: Most of the data I have to work with is generated by
Hadoop Streaming apps and consists in tab-separated pairs of
JSON-encoded data. If I am not mistaken, this is not something that
could be parsed by JsonLoader or ElephantBird directly so I wrote a
python UDF to decode the JSON data but I was unable to use it as
expected in Pig code. I kept getting strange errors even though my
data was in the same schema as data used in some documentation
examples and I was doing exactly the same kind of manipulations.

Cheers.

Tiago Alves Macambira

Reply via email to