Calculation-data = LOAD ... Actual_data = LOAD .. cd_group = GROUP Calculation-data ALL; ad_group = GROUP Actual_data ALL;
result = FOREACH ad_group GENERATE udf.name(Actual_data, Calculation-data); Actual_data will be a bag of tuples (field1, field2) and Calculation-data will be a bag of tuples (field). MAybe you can post your python code as well? 2011/1/13 <deepak....@wipro.com> > Hi Jonathan, > > Thanks for your response. > > How would the PIG Statement look? > > Building upon my previous example, > > ------------ > Calculation-data = LOAD /path/to/data AS (field:int); > Actual_data = LOAD /path/to/data AS (field1:int, field2:int); > > -- I want to do something like this: > Result = FOREACH Actual_data GENERATE > udf_file_name.fun_name(Calculation_data, field1, field2); > > -- But it's not legal. > ------------- > > There is an example in the UDFScriptingwithPython page on the PIG site, > where an internal bag is being converted to an outer bag. But that doesn't > seem to be very relevant here, except for the point that if I somehow pass > an outer or inner bag into the UDF, I can work upon it like any other python > list. > > Thanks, > Deepak > > -----Original Message----- > From: Jonathan Coveney [mailto:jcove...@gmail.com] > Sent: Thursday, January 13, 2011 6:31 PM > To: user@pig.apache.org > Subject: Re: Pass an Outer Bag into UDF > > Woops, look at me, I didn't realize you were passing it to python. The same > should work, just...in python. The bags will be lists, the input a tuple of > those two bag lists. > > 2011/1/13 Jonathan Coveney <jcove...@gmail.com> > > > You absolutely can do this, I'm just not sure if you can do it using > > the accumulator interface (I THINK you can, as I think it ratchets > > only the first tuple input, and passes the entire second one, but am > > not sure, someone else can weigh in). If you CAN do it with the > > accumulator interface though I highly recommend it, as it's more memory > efficient. > > > > Basicaly, you'll have this as your pig script: > > > > public class doublebag { > > public <OutputType> exec(Tuple input) throws IOException { > > DataBag innerBag = (DataBag)input.get(0); > > DataBag outerBag = (DataBag)input.get(1); > > Iterator<Tuple> ibit = innerBag.iterator(); > > while (ibit.hasNext()) { > > Tuple ibelem = ibit.next(); > > Iterator<Tuple> obit = outerBag.iterator(); > > while (obit.hasNext()) { > > Tuple obelem = obit.next(); > > } > > } > > } > > } > > > > Obviously, you need to do things like checking for empty input etc > > etc, this is just a rough rough example of the code. The point is, if > > you just do exec, you'll simply have a bag as your first input and a bag > as the second. > > > > And then in your code if you want to do one entire bag against > > another, you'd just do a group thing all; and pass it that thing. > > > > Hope that explanation made sense, if it didn't just ask again. It's > > worth going over the bag->bag example in the UDF manual, and really, > > you just have > > 2 bag inputs instead of one. > > > > 2011/1/13 <deepak....@wipro.com> > > > > Hi, > >> > >> I wish to pass an outer bag into a Python UDF. > >> > >> Something like: > >> > >> Calculation-data = LOAD /path/to/data AS (field;int); Actual_data = > >> LOAD /path/to/data AS (field1:int, field2:int) > >> > >> > >> Calculation_data is not a very big bag. Maybe about 500 tuples in all > >> - a single file. > >> Actual_data is the real data source lying on HDFS. > >> > >> For each tuple in Actual_data, I wish to have the entire > >> Calculation_data (whole bag) for some calculation that I wish to do. > >> > >> So, with every call to a UDF, I need to pass this bag along with > >> tuple from Actual_data. > >> > >> Simply, is there any way of passing an outer bag into a Python UDF? > >> > >> Regards, > >> Deepak > >> Please do not print this email unless it is absolutely necessary. > > Please do not print this email unless it is absolutely necessary. > > The information contained in this electronic message and any attachments to > this message are intended for the exclusive use of the addressee(s) and may > contain proprietary, confidential or privileged information. If you are not > the intended recipient, you should not disseminate, distribute or copy this > e-mail. Please notify the sender immediately and destroy all copies of this > message and any attachments. > > WARNING: Computer viruses can be transmitted via email. The recipient > should check this email and any attachments for the presence of viruses. The > company accepts no liability for any damage caused by any virus transmitted > by this email. > > www.wipro.com >