Thanks Yong and Mridul,

I was able to the the trick like this:

A = LOAD 'peoples.txt' USING PigStorage(';') AS (name : chararray, pets_ids
: chararray);

B = FOREACH A GENERATE name, TOKENIZE(REPLACE(pets_ids, ',', ' ')) AS
products_bag;
DUMP B;
DESCRIBE B;

C = FOREACH B GENERATE name, FLATTEN(products_bag) as (user_pet_id :
chararray);
DUMP C;
DESCRIBE C;

D = LOAD 'pets.txt' USING PigStorage(';') AS (id : chararray, type :
chararray, race: chararray);
DUMP D;
DESCRIBE D;

reqd_op = JOIN C BY user_pet_id, D BY id PARALLEL 5;
DUMP reqd_op;


For your info and maybe to help others, TOBAG(STRPLIT(...)) is creating a
bag containing only one tuple with all the fields, this is why it was not
working.

So you can do either:

*if you know the number of columns :*

B = FOREACH A GENERATE name, FLATTEN(STRPLIT(pets_ids, ',')) AS (id1 :
chararray, id2 : chararray, id3 : chararray);

C = FOREACH B GENERATE name, FLATTEN(TOBAG(id1,id2,id3));
DUMP C;
(tom,1234)
(tom,4567)
(tom,6)
(anna,27894)
(anna,)
(anna,)

*if you don't know the number of columns :*

B = FOREACH A GENERATE name, TOKENIZE(REPLACE(pets_ids, ',', ' ')) AS
products_bag;

C = FOREACH B GENERATE name, FLATTEN(products_bag) as (user_pet_id :
chararray);
(tom,1234)
(tom,4567)
(tom,6)
(anna,27894)

No null here.

I hope this can help some others.

Cheers


Vincent


On Tue, May 10, 2011 at 4:01 PM, Vincent <[email protected]> wrote:

> Hi Yong, Hi Mridul,
>
> I've changed everything to chararray:
>
>
> A = LOAD 'peoples.txt' USING PigStorage(';') AS (name : chararray, pets_ids
> : chararray);
>
> B = foreach A GENERATE name, STRSPLIT(pets_ids, ',') AS pets_ids_separated;
> DUMP B;
> DESCRIBE B;
>
> C = FOREACH B GENERATE name, FLATTEN(TOBAG(pets_ids_separated)) AS (id);
> DUMP C;
> DESCRIBE C;
>
> D = LOAD 'pets.txt' USING PigStorage(';') AS (id : chararray, type :
> chararray, race: chararray);
> DUMP D;
> DESCRIBE D;
>
> reqd_op = JOIN C BY id, D BY id PARALLEL 5;
> DUMP reqd_op;
>
> But I still have the error:
> 2011-05-10 15:59:42,472 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1107: Cannot merge join keys, incompatible types
> Details at logfile: /local/tmp/test/pig_expand/pig_1305028757759.log
>
> DUMP C is
> (tom,(1234,4567,6))
> (anna,(27894))
>
> Description is C: {name: chararray,id: (null)}
>
> How can JOIN operation can support on left hand a Tuple of chararray, and
> on right hand a chararray?
>
> BR
>
> Vincent
>
>
>
>
> On Tue, May 10, 2011 at 3:43 PM, 勇胡 <[email protected]> wrote:
>
>> You can see the type of join keys are different. One is chararray, the
>> other
>> is int. You have to change them into the same type.
>>
>> Yong
>>
>> 2011/5/10 Vincent <[email protected]>
>>
>> > According to your advices I wrote the following:
>> >
>> > *A = LOAD 'peoples.txt' USING PigStorage(';') AS (name : chararray,
>> > pets_ids
>> > : chararray);
>> >
>> > B = foreach A GENERATE name, STRSPLIT(pets_ids, ',') AS
>> pets_ids_separated;
>> > DUMP B;
>> > DESCRIBE B;
>> >
>> > C = FOREACH B GENERATE name, FLATTEN(TOBAG(pets_ids_separated)) AS
>> > user_pet_id;
>> > DUMP C;
>> > DESCRIBE C;
>> >
>> > D = LOAD 'pets.txt' USING PigStorage(';') AS (id : int, type :
>> chararray,
>> > race: chararray);
>> >
>> >
>> > reqd_op = JOIN C BY user_pet_id, D BY id PARALLEL 5;
>> > DUMP reqd_op;*
>> >
>> > But I have the following error:
>> > 2011-05-10 15:30:04,036 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> > ERROR 1107: Cannot merge join keys, incompatible types
>> > Details at logfile: /local/tmp/test/pig_expand/pig_1305026987213.log
>> >
>> > Any idea, what it goes wrong here?
>> >
>> > Best Regards
>> >
>> > Vincent
>> >
>> >
>> >
>> > On Tue, May 10, 2011 at 3:04 PM, Mridul Muralidharan
>> > <[email protected]>wrote:
>> >
>> > >
>> > > I am not sure I follow your query related to PARALLEL.
>> > > The value for parallel is a static value.
>> > >
>> > > I was using $MY_PARALLEL as a placeholder to specify what sort of
>> > > parallelism you need.
>> > >
>> > > Typically you will have a default value in the script
>> > >
>> > > %default MY_PARALLEL '10'
>> > >
>> > > And override it, when required, using command line pig -param
>> > > MY_PARALLEL=50 ...
>> > >
>> > >
>> > >
>> > > Regards,
>> > > Mridul
>> > >
>> > >
>> > > On Tuesday 10 May 2011 04:26 PM, Vincent wrote:
>> > >
>> > >> Thanks Mridul for your quick answer!
>> > >>
>> > >> According to documentation PARALLEL is setting the number of reduce
>> > >> tasks. So how can I make it taking an UDF instead? Is there any
>> example
>> > >> of such functions in SVN/pig0.8 package?
>> > >>
>> > >> Best Regards
>> > >>
>> > >> Vincent
>> > >>
>> > >> On Tue, May 10, 2011 at 2:02 PM, Mridul Muralidharan
>> > >> <[email protected] <mailto:[email protected]>> wrote:
>> > >>
>> > >>
>> > >>    Easy option would be to write your own udf which can catch corner
>> > >>    cases, etc  ..
>> > >>    But assuming your data strictly follows what you mentioned,
>> > >>    something like this might help (illustrative only !) :
>> > >>
>> > >>    pets = load 'pets.txt'  USING PigStorage(';') AS
>> (pet_id:chararray,
>> > >>    pet_type:chararray, pet_name:chararray);
>> > >>
>> > >>    people = load 'peoples.txt'  USING PigStorage(';') AS
>> > >>    (user:chararray, ids:chararray);
>> > >>    people_t = FOREACH people GENERATE user, STRSPLIT(ids, ',');
>> > >>    -- STRSPLIT returns a tuple, not a bag : so convert to bag and
>> > >>    flatten it.
>> > >>    people_reqd = FOREACH people_t GENERATE user, FLATTEN(TOBAG($1))
>> as
>> > >>    (user_pet_id);
>> > >>
>> > >>
>> > >>    reqd_op = JOIN people_reqd BY user_pet_id, pets BY pet_id PARALLEL
>> > >>    $MY_PARALLEL;
>> > >>
>> > >>
>> > >>    reqd_op should contain what you need ...
>> > >>
>> > >>
>> > >>
>> > >>    Regards,
>> > >>    Mridul
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>    On Tuesday 10 May 2011 03:00 PM, Vincent wrote:
>> > >>
>> > >>        Hello dear Pig users,
>> > >>
>> > >>        *I am loading a file with the following format:*
>> > >>
>> > >>        *$ cat peoples.txt
>> > >>        tom;1234,4567,6
>> > >>        anna;27894*
>> > >>        First field is a name, second field is a concatenation of an
>> > >>        unknown number
>> > >>        of pets ids.
>> > >>
>> > >>        *I would like to JOIN this file with another one:*
>> > >>
>> > >>        *$ cat pets.txt
>> > >>        1234;dog;cocker
>> > >>        4567;mouse;usa
>> > >>        6;cat;persian
>> > >>        27894;cat;manx
>> > >>        *Fields are pet's id, pet's type, pet's race.
>> > >>        *
>> > >>        to get the following result:*
>> > >>
>> > >>        *1234;dog;cocker;tom
>> > >>        4567;mouse;usa;tom
>> > >>        6;cat;persian;tom
>> > >>        27894;cat;manx;anna*
>> > >>
>> > >>        *Problem is that I don't know how to convert a tuple of fields
>> > >>        to lines,
>> > >>        i.e. to put the file peoples.txt into the following
>> intermediate
>> > >>        format:*
>> > >>        *tom,1234
>> > >>        tom,4567
>> > >>        tom,6
>> > >>        anna,27894*
>> > >>
>> > >>        Thanks in advance for your help!
>> > >>
>> > >>
>> > >>             Vincent Hervieux
>> > >>
>> > >>
>> > >>
>> > >>
>> > >
>> >
>>
>
>

Reply via email to