Hi, So I found a somewhat easy way to replicate this error with this script running in a cluster (distributed). The setting at the top are artificial to produce the result with only a few lines:
set pig.exec.reducers.bytes.per.reducer 32 set pig.exec.reducers.max 20 X = LOAD '$INPUT' USING PigStorage('$SEPARATOR'); Y = FOREACH X GENERATE COUNT_STAR(TOBAG($0 ..)) as count; GROUPED = GROUP Y BY count; MAX = FOREACH GROUPED GENERATE group as tokennum, COUNT(Y) as count; MAXG = GROUP MAX ALL; MAXX = foreach MAXG generate FLATTEN(TOP(1,1,MAX)); MAXX = foreach MAXX generate $0 as tokennum; Z = CROSS MAXX, X; STORE Z INTO '$OUT' USING PigStorage('$SEPARATOR'); As input I took the line: 1 1 Repeated 13 times. I think the only think that matters is that pig decides to use more than 1 reducer. In my case this was enough for pig to use 20 reducers. This will yield: Input(s): Successfully read 13 records (413 bytes) from: "/user/mehmet/input2" Output(s): Successfully stored 2 records (12 bytes) in: "/tmp/mehmet/out" But it should be creating 13 lines as it just appends the MAXX to each input line. 2 odd facts: 1. If you replace Z = CROSS MAXX, X by Z = CROSS MAXX, X parallel 20 the problem goes away. (Perhaps the CROSS function is not getting the number of reducers value correctly when it is calculated): Input(s): Successfully read 13 records (413 bytes) from: "/user/mehmet/input2" Output(s): Successfully stored 13 records (78 bytes) in: "/tmp/mehmet/out" 2. If you skip all the steps that yield MAXX and just load MAXX from a file, the problem goes away also, which is strange as why should it matter where MAXX originated from? I am using Hadoop 2.0.0-cdh4.2.0, Pig version 0.10.0-cdh4.1.2 Mehmet On 5/21/13 1:41 AM, "Jonathan Coveney" <jcove...@gmail.com> wrote: >Any chance you could replicate this for us? Ideally some dummy data and a >script? > > >2013/5/19 Mehmet Tepedelenlioglu <mehmets...@yahoo.com> > >> Hi, >> >> Recently I was taking the cross product between 2 bags of tuples one of >> which has only one tuple, to append the one with one element to all the >> others (I know this is not the best way to do this, it was done as a >> prototype). There seems to be a bug with the cross product where not all >> the >> tuples of the larger bag are replicated. All but one of the part files >>are >> empty, and everything works just fine in the local mode (probably >>because >> it >> uses only one reducer). Is anybody else aware of this issue? >> >> The version is: >> >> Apache Pig version 0.10.0-cdh4.1.2 (rexported) >> compiled Nov 01 2012, 18:38:33 >> >> Thanks, >> >> Mehmet >> >> >>