I'm going to try and replicate this with a simpler query, but it does look like a bug to me.
In this case, I think you can avoid the issue by doing changing to the following two lines: inter_frac_sum = FOREACH inter_frac_combine GENERATE flatten($0) as (a,b) SUM(inter_frac.frac) as frac:double; filtered = FILTER inter_frac_sum BY ($2 >= 0.5); grouped = GROUP filtered by seedword; That said, your case is valid, and it should work. There is actually a ticket that deals with this...it's the nested access within bags in pig (ie $1.$0.$0.$0....) Let me know if the above works. It changes the layout a little, but should be fairly obvious. 2012/1/12 Yulia Tolskaya <[email protected]> > That didn't word :( > Here's my code > keywords = LOAD 'merged' USING as ( seedword:chararray, doc:chararray); > > ---COUNT HOW MANY DOCUMENTS EACH WORD IS IN > group_by_seedword = GROUP keywords BY $0; > > invert_index = FOREACH group_by_seedword GENERATE $0 as > seedword:chararray, keywords.$1; > word_doc_count= FOREACH invert_index GENERATE seedword, COUNT($1); > > -- map words to document > words_in_doc= GROUP keywords BY doc; > word_docs = FOREACH words_in_doc GENERATE group AS doc, keywords.seedword; > --(document:(keyword, keyword, keyword...)) > > --map words to their cowords in doc > temp_join = JOIN keywords BY doc,word_docs BY doc; > --DUMP temp_join; > cowords_by_doc = FOREACH temp_join GENERATE $0 as seedword:chararray, $3 > as cowords; > > cowords_interm= FOREACH cowords_by_doc GENERATE seedword, > FLATTEN(cowords); > cowords = FILTER cowords_interm BY (seedword!=$1);---GETS RID OF SINGLE > DOC WORD; > temp_join_count1 = JOIN cowords BY $0, word_doc_count BY seedword; > > -- GETS WORDS THAT OCCURE BY THEMSELVES IN A SINGLE DOCUMENT > G = JOIN cowords_interm BY $0 LEFT OUTER, cowords by $0; > orph_word = FILTER G BY $2 is null; > orph_word_count = FOREACH orph_word GENERATE $0,null, 0; > > temp_join_count= UNION temp_join_count1, orph_word_count; > > inter_frac = FOREACH temp_join_count GENERATE $0 as seedword:chararray, $1 > as coword:chararray, 1.0/$3 as frac:double; > inter_frac_combine = GROUP inter_frac BY (seedword, coword); > inter_frac_sum = FOREACH inter_frac_combine GENERATE $0 , > SUM(inter_frac.frac) as frac:double; > > filtered = FILTER inter_frac_sum BY ($1 >=$relatedness_ratio); > grouped= GROUP filtered by $0.seedword; > g = FOREACH grouped GENERATE group as seedword:chararray, filtered.$0; > named = FOREACH g GENERATE $0 as seedword:chararray, $1 as > baggy:bag{(outertup:tuple(groupy:tuple(seedword:chararray, > coword:chararray)))} > > > The input file is > car doc1.txt > auto doc1.txt > bunny doc2.txt > ball doc2.txt > toy car doc2.txt > random doc3.txt > > Plane doc3.txt > > > On 1/12/12 4:58 PM, "Jonathan Coveney" <[email protected]> wrote: > > >Try: > > > >a = foreach grouped generate seedword, baggy.outertup as tup; > >b = foreach grouped generate seedword, flatten(tup.groupy) as (coword); > > > >do you think you could post a script that gets you to the grouped part of > >it? It would make it much easier to help you. > > > >2012/1/12 Yulia Tolskaya <[email protected]> > > > >> That did not work I get an error of : > >> Cannot find field coword in groupy:tuple(seedword:chararray,coward:char > >> array) > >> IF I try > >> FOREACH grouped GENERATE seedword, baggy.groupy; > >> I also get an error: > >> Invalid field reference. Referenced field [groupy] does not exist in > >> schema: seedword:chararray,coward:char array. (so it does seem to be > >> ignoring all the nested tuples). > >> > >> This does seem like a bug! > >> Can you think of a better solution? I could write a UDF to get rid of > >>the > >> nested tuples, but that really seems unnecessary. > >> > >> Thank you > >> > >> Yulia > >> > >> > >> On 1/12/12 2:50 PM, "Jonathan Coveney" <[email protected]> wrote: > >> > >> >Hmm, this is an interesting case. I think there may be a bug here. > >> > > >> >grunt> grouped = load 'thing' as > >> >(seedword:chararray,baggy:{outertup:(groupy:(seedword:chararray, > >> >coword:chararray))}); > >> >grunt> describe grouped; > >> >grouped: {seedword: chararray,baggy: {groupy: (seedword: > >>chararray,coword: > >> >chararray)}} > >> > > >> >notice that outertup has been thrown out. I suppose having a bunch of > >> >nested tuples is equivalent to getting rid of them, still, it's not > >> >something that pig should do. For the Pig devs, is this expected? > >> > > >> >Either way. in this case, you would do: > >> > > >> >a = foreach grouped generate seedword, baggy.coword; > >> > > >> >and go for there > >> > > >> >let me know if that works > >> > > >> >2012/1/12 Yulia Tolskaya <[email protected]> > >> > > >> >> I have been stuck on this for several hours and I cannot figure out > >> >>what I > >> >> am doing wrong. I have a relation "grouped" with the schema of > >> >> > >> >> grouped: {seedword: chararray,baggy: {outertup: (groupy: (seedword: > >> >> chararray,coword: chararray))}} > >> >> > >> >> I need to generate just the seedword and a tuple of cowords. In my > >> >>example > >> >> I would want > >> >> > >> >> (auto, (car, truck)). > >> >> > >> >> I have tried: > >> >> > >> >> FOREACH grouped GENERATE baggy::outertup.groupy.coword; > >> >> > >> >> FOREACH grouped GENERATE baggy.outertup.groupy.coword; > >> >> FOREACH grouped GENERATE baggy.groupy.coword; > >> >> > >> >> > >> >> and none of these (or other similar variations) work, and give me > >>error > >> >> messages saying there is no such field. I"m assuming there has to be > >>a > >> >>way > >> >> to do this. Please help!! > >> >> > >> >> > >> >> Yulia > >> >> > >> >> > >> > >> > >
