Hi Malcolm, arrays are converted to tuples and flatten should directly work on it. I think you need not worry about the delimiter (assuming hive knows how to deserialize it). Btw, does RCFile require delimiter to store arrays? I am not sure about that.
Thanks, Aniket On Wed, Apr 11, 2012 at 8:14 PM, Norbert Burger <[email protected]>wrote: > A little wonky, but try wrapping the flattened tuple elements in a bag, and > then re-flattening that: > > A = LOAD 'test.txt' USING PigStorage(',') AS > (C_SUB_ID:chararray,seg_ids:chararray); > B = FOREACH A GENERATE C_SUB_ID,FLATTEN(STRSPLIT(seg_ids,':')); > C = FOREACH B GENERATE $0,FLATTEN(TOBAG($1..)); > > Only flattened bags generate the cols -> rows transformation that you're > trying to make. Flattened tuples, on the other hand, simply explode the > tuple into its composite elements, but without creating the multiple rows > ("cross product') in your relation. A custom UDF would be another option > here. > > Norbert > > On Wed, Apr 11, 2012 at 6:59 PM, Malcolm Tye <[email protected] > >wrote: > > > Hi Norbert, > > I don't seem to be getting what I'm after. If my data looks > like > > this > > > > 1133957209,61:0:1 > > 4524524233,21:0 > > > > I want to produce > > > > 1133957209,61 > > 1133957209,0 > > 1133957209,1 > > 4524524233,21 > > 4524524233,0 > > > > I changed the LOAD statement to > > > > mt = LOAD '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING > > org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID > > string,seg_ids > > array'); > > opt = foreach mt generate C_SUB_ID, FLATTEN(STRSPLIT(seg_ids,':')) as > > s_seg_id; > > > > I don't seem to be getting the cross product, just something like the > > following > > > > 1133957209,61,0,1 > > 4524524233,21,0 > > > > Any ideas ? > > > > > > Thanks > > > > Malc > > > > > > -----Original Message----- > > From: Norbert Burger [mailto:[email protected]] > > Sent: 06 April 2012 16:01 > > To: [email protected] > > Subject: Re: "Exploding" a Hive array<string> in Pig from an RCFile > > > > Malcolm -- typically, you'd use a STRSPLIT and optional FLATTEN to > tokenize > > a chararray on some delimeter. So the following should work: > > > > opt = foreach mt generate C_SUB_ID, flatten(STRSPLIT(seg_ids,':')) as > > s_seg_id; > > > > Norbert > > > > On Thu, Apr 5, 2012 at 8:58 AM, Malcolm Tye > > <[email protected]>wrote: > > > > > Hi, > > > I'm storing data into a partitioned table using Hive in RCFile > > > format, but I want to use Pig to do the aggregation of that data. > > > > > > In my array <string> in Hive, I have colon delimited data, E.g. > > > > > > :0:12:21:99: > > > > > > With the lateral view and explode functions in Hive, I can output each > > > value as a separate row. > > > > > > In Pig, I think I need to use flatten, but it just outputs the array > > > as a single field, and I can't see where to specify that the delimiter > > > is the delimiter/value separator > > > > > > register /opt/pig/trunk/bin/piggybank.jar mt = LOAD > > > '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING > > > org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID > > > string,seg_ids > > > array<string>'); > > > opt = foreach mt generate C_SUB_ID, flatten(seg_ids) as s_seg_id; dump > > > opt; > > > > > > > > > > > > Thanks > > > > > > Malc > > > > > > > > > > > > > > -- "...:::Aniket:::... Quetzalco@tl"
