Easy option would be to write your own udf which can catch corner cases,
etc ..
But assuming your data strictly follows what you mentioned, something
like this might help (illustrative only !) :
pets = load 'pets.txt' USING PigStorage(';') AS (pet_id:chararray,
pet_type:chararray, pet_name:chararray);
people = load 'peoples.txt' USING PigStorage(';') AS (user:chararray,
ids:chararray);
people_t = FOREACH people GENERATE user, STRSPLIT(ids, ',');
-- STRSPLIT returns a tuple, not a bag : so convert to bag and flatten it.
people_reqd = FOREACH people_t GENERATE user, FLATTEN(TOBAG($1)) as
(user_pet_id);
reqd_op = JOIN people_reqd BY user_pet_id, pets BY pet_id PARALLEL
$MY_PARALLEL;
reqd_op should contain what you need ...
Regards,
Mridul
On Tuesday 10 May 2011 03:00 PM, Vincent wrote:
Hello dear Pig users,
*I am loading a file with the following format:*
*$ cat peoples.txt
tom;1234,4567,6
anna;27894*
First field is a name, second field is a concatenation of an unknown number
of pets ids.
*I would like to JOIN this file with another one:*
*$ cat pets.txt
1234;dog;cocker
4567;mouse;usa
6;cat;persian
27894;cat;manx
*Fields are pet's id, pet's type, pet's race.
*
to get the following result:*
*1234;dog;cocker;tom
4567;mouse;usa;tom
6;cat;persian;tom
27894;cat;manx;anna*
*Problem is that I don't know how to convert a tuple of fields to lines,
i.e. to put the file peoples.txt into the following intermediate format:*
*tom,1234
tom,4567
tom,6
anna,27894*
Thanks in advance for your help!
Vincent Hervieux