Thanks Tariq for the explanations. Once there's one name associated to the union then we can consider it as one input I assume.
Keren On Thu, Jul 25, 2013 at 12:33 AM, Mohammad Tariq <[email protected]> wrote: > You could try something like this : > > A = load '/1.txt' using PigStorage(' ') as (x:int, y:chararray, > z:chararray); > > B = load '/1_ext.txt' using PigStorage(' ') as (a:int, b:chararray, > c:chararray); > > C = union A, B; > > D = group C by 1; > > E = foreach D generate flatten(C); > > store E into '/dir'; > > Warm Regards, > Tariq > cloudfront.blogspot.com > > > On Thu, Jul 25, 2013 at 12:52 PM, Mohammad Tariq <[email protected]> > wrote: > > > Hello Keren, > > > > There is nothing wrong in this. One dataset in Hadoop is usually one > > folder and not one file. Pig is doing what it is supposed to do and > > performing a union on both the files. You would have seen the content of > > both the files together while doing dump C. > > > > Since this is a map only job, and 2 mappers are getting generated, you > are > > getting 2 separate files. Which is actually one complete dataset. If you > > want to have just one file, you need to force a reduce so that you get > all > > the results collectively in a single output file. > > > > HTH > > > > Warm Regards, > > Tariq > > cloudfront.blogspot.com > > > > > > On Thu, Jul 25, 2013 at 11:31 AM, Keren Ouaknine <[email protected]> > wrote: > > > >> Hi, > >> > >> According to Pig's documention on union, two schemas which have the same > >> schema (have the same length and types can be implicitly cast) can be > >> concatenated (see http://pig.apache.org/docs/r0.11.1/basic.html#union) > >> > >> However, when I try with: > >> A = load '1.txt' using PigStorage(' ') as (x:int, y:chararray, > >> z:chararray); > >> B = load '1_ext.txt' using PigStorage(' ') as (a:int, b:chararray, > >> c:chararray); > >> C = union A, B; > >> describe C; > >> DUMP C; > >> store C into '/home/kereno/Documents/pig-0.11.1/workspace/res'; > >> > >> with: > >> ~/Documents/pig-0.11.1/workspace 130$ more 1.txt 1_ext.txt > >> :::::::::::::: > >> 1.txt > >> :::::::::::::: > >> 1 a aleph > >> 2 b bet > >> 3 g gimel > >> :::::::::::::: > >> 1_ext.txt > >> :::::::::::::: > >> 0 a alpha > >> 0 b beta > >> 0 g gimel > >> > >> > >> I get in result:~/Documents/pig-0.11.1/workspace 0$ more > res/part-m-0000* > >> :::::::::::::: > >> res/part-m-00000 > >> :::::::::::::: > >> 0 a alpha > >> 0 b beta > >> 0 g gimel > >> :::::::::::::: > >> res/part-m-00001 > >> :::::::::::::: > >> 1 a aleph > >> 2 b bet > >> 3 g gimel > >> > >> Whereas I was expecting something like > >> 0 a alpha > >> 0 b beta > >> 0 g gimel > >> 1 a aleph > >> 2 b bet > >> 3 g gimel > >> > >> [all together] > >> > >> I understand that two files for non-matching schemas would be generated > >> but > >> why for union with a matching schema? > >> > >> Thanks, > >> Keren > >> > >> -- > >> Keren Ouaknine > >> Web: www.kereno.com > >> > > > > > -- Keren Ouaknine Web: www.kereno.com
