Unfortunately ForEach inner plan does not support stream now. Here are some choices: 1. You can customize input/output of your perl script. Check http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#DEFINE, search "About Input and Output"
2. Use UDF instead of stream

result = FOREACH awesome_items_grouped_by_source generate 
your_udf(bag_contents);

Inside udf, you can iterate the bag, and process individual fields. You can also use Python (but no perl) to write your udf. Check http://wiki.apache.org/pig/UDFsUsingScriptingLanguages

Daniel

Dragos Munteanu wrote:
I'm either unhappy with streaming bags, or I'm doing something wrong.
I have a relation, and I want to apply a user-defined operation on its
elements, conditioned on the field "source".
Let¹s say that my relation ³awesome_items² has schema (source:chararray,
str1:chararray, str2:chararray); and my operation is to take all elements
that have the same source and produce a single element (source,
concatenation_of_all_str1, concatenation_of_all_str2).
One way to do it is: awesome_items_grouped_by_source = GROUP awesome_items BY source;
result = STREAM awesome_items_grouped_by_source THROUGH myScript.pl;
So myScript.pl will receive lines of input that look like: (source1 TAB {(source1,first str1,first str2),(source1,second str1,second
str2)}
and will have to extract the individual elements from the bag (and then do
the concatenations for all elements in the same bag).
My biggest problem is the format of the bag: comma-separated elements, each
with comma-separated fields; this is hard to parse correctly, given that the
strings can themselves have commas and parentheses (what if the first
element of the bag has, as value of str1, ³first,str1²? so the bag element
will be (source1,first,str1,first str2)? )
Another problem is that the bags can get very large; and passing a bag with
thousands of items to Perl on a single line might create all sorts of other
problems.
Is there a way to have the bag send its contents into the perl script one
tuple at a time? Or somehow generate a tab-delimited format?
I would like to do something like:
result = FOREACH awesome_items_grouped_by_source {
  bag_contents = FLATTEN(awesome_items);
  result = STREAM bag_contents THROUGH myScript.pl
  GENERATE result;
}
But that doesn¹t work for several reasons. Of course, I can get around the first problem by doing stuff like ³escaping²
all the commas and parentheses from my strings, thus eliminating
ambiguities. But in my real situation, that¹s hard to do. Also, this
wouldn¹t address the second problem.

Thanks
Dragos
SDL PLC confidential, all rights reserved.
If you are not the intended recipient of this mail SDL requests and requires 
that you delete it without acting upon or copying any of its contents, and we 
further request that you advise us.
SDL PLC is a public limited company registered in England and Wales.  
Registered number: 02675207.
Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6 7DY, 
UK.

Reply via email to