Calculation-data = LOAD ...
Actual_data  = LOAD ..

cd_group = GROUP Calculation-data ALL;
ad_group = GROUP Actual_data ALL;

result = FOREACH ad_group GENERATE udf.name(Actual_data, Calculation-data);

Actual_data will be a bag of tuples (field1, field2) and Calculation-data
will be a bag of tuples (field).

MAybe you can post your python code as well?

2011/1/13 <deepak....@wipro.com>

> Hi Jonathan,
>
> Thanks for your response.
>
> How would the PIG Statement look?
>
> Building upon my previous example,
>
> ------------
> Calculation-data = LOAD /path/to/data AS (field:int);
> Actual_data = LOAD /path/to/data AS (field1:int, field2:int);
>
> -- I want to do something like this:
> Result = FOREACH Actual_data GENERATE
> udf_file_name.fun_name(Calculation_data, field1, field2);
>
> -- But it's not legal.
> -------------
>
> There is an example in the UDFScriptingwithPython page on the PIG site,
> where an internal bag is being converted to an outer bag. But that doesn't
> seem to be very relevant here, except for the point that if I somehow pass
> an outer or inner bag into the UDF, I can work upon it like any other python
> list.
>
> Thanks,
> Deepak
>
> -----Original Message-----
> From: Jonathan Coveney [mailto:jcove...@gmail.com]
> Sent: Thursday, January 13, 2011 6:31 PM
> To: user@pig.apache.org
> Subject: Re: Pass an Outer Bag into UDF
>
> Woops, look at me, I didn't realize you were passing it to python. The same
> should work, just...in python. The bags will be lists, the input a tuple of
> those two bag lists.
>
> 2011/1/13 Jonathan Coveney <jcove...@gmail.com>
>
> > You absolutely can do this, I'm just not sure if you can do it using
> > the accumulator interface (I THINK you can, as I think it ratchets
> > only the first tuple input, and passes the entire second one, but am
> > not sure, someone else can weigh in). If you CAN do it with the
> > accumulator interface though I highly recommend it, as it's more memory
> efficient.
> >
> > Basicaly, you'll have this as your pig script:
> >
> > public class doublebag {
> >   public <OutputType> exec(Tuple input) throws IOException {
> >     DataBag innerBag = (DataBag)input.get(0);
> >     DataBag outerBag = (DataBag)input.get(1);
> >     Iterator<Tuple> ibit = innerBag.iterator();
> >     while (ibit.hasNext()) {
> >        Tuple ibelem = ibit.next();
> >        Iterator<Tuple> obit = outerBag.iterator();
> >        while (obit.hasNext()) {
> >          Tuple obelem = obit.next();
> >        }
> >     }
> >   }
> > }
> >
> > Obviously, you need to do things like checking for empty input etc
> > etc, this is just a rough rough example of the code. The point is, if
> > you just do exec, you'll simply have a bag as your first input and a bag
> as the second.
> >
> > And then in your code if you want to do one entire bag against
> > another, you'd just do a group thing all; and pass it that thing.
> >
> > Hope that explanation made sense, if it didn't just ask again. It's
> > worth going over the bag->bag example in the UDF manual, and really,
> > you just have
> > 2 bag inputs instead of one.
> >
> > 2011/1/13 <deepak....@wipro.com>
> >
> > Hi,
> >>
> >> I wish to pass an outer bag into a Python UDF.
> >>
> >> Something like:
> >>
> >> Calculation-data = LOAD /path/to/data AS (field;int); Actual_data =
> >> LOAD /path/to/data AS (field1:int, field2:int)
> >>
> >>
> >> Calculation_data is not a very big bag. Maybe about 500 tuples in all
> >> - a single file.
> >> Actual_data is the real data source lying on HDFS.
> >>
> >> For each tuple in Actual_data, I wish to have the entire
> >> Calculation_data (whole bag) for some calculation that I wish to do.
> >>
> >> So, with every call to a UDF, I need to pass this bag along with
> >> tuple from Actual_data.
> >>
> >> Simply, is there any way of passing an outer bag into a Python UDF?
> >>
> >> Regards,
> >> Deepak
> >> Please do not print this email unless it is absolutely necessary.
>
> Please do not print this email unless it is absolutely necessary.
>
> The information contained in this electronic message and any attachments to
> this message are intended for the exclusive use of the addressee(s) and may
> contain proprietary, confidential or privileged information. If you are not
> the intended recipient, you should not disseminate, distribute or copy this
> e-mail. Please notify the sender immediately and destroy all copies of this
> message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient
> should check this email and any attachments for the presence of viruses. The
> company accepts no liability for any damage caused by any virus transmitted
> by this email.
>
> www.wipro.com
>

Reply via email to