I'm going to try and replicate this with a simpler query, but it does look
like a bug to me.

In this case, I think you can avoid the issue by doing changing to the
following two lines:

inter_frac_sum = FOREACH inter_frac_combine GENERATE flatten($0) as (a,b)
SUM(inter_frac.frac) as frac:double;

filtered = FILTER inter_frac_sum BY ($2 >= 0.5);

grouped = GROUP filtered by seedword;


That said, your case is valid, and it should work. There is actually a
ticket that deals with this...it's the nested access within bags in pig (ie
$1.$0.$0.$0....)


Let me know if the above works. It changes the layout a little, but should
be fairly obvious.


2012/1/12 Yulia Tolskaya <[email protected]>

> That didn't word :(
> Here's my code
> keywords = LOAD 'merged' USING as ( seedword:chararray, doc:chararray);
>
> ---COUNT HOW MANY DOCUMENTS EACH WORD IS IN
> group_by_seedword = GROUP keywords BY $0;
>
> invert_index = FOREACH group_by_seedword GENERATE $0 as
> seedword:chararray, keywords.$1;
> word_doc_count= FOREACH invert_index GENERATE seedword, COUNT($1);
>
> -- map words to document
> words_in_doc= GROUP keywords BY doc;
> word_docs = FOREACH words_in_doc GENERATE group AS doc, keywords.seedword;
> --(document:(keyword, keyword, keyword...))
>
> --map words to their cowords in doc
> temp_join = JOIN keywords BY doc,word_docs BY doc;
> --DUMP temp_join;
> cowords_by_doc = FOREACH temp_join GENERATE $0 as seedword:chararray, $3
> as cowords;
>
> cowords_interm=  FOREACH cowords_by_doc GENERATE seedword,
> FLATTEN(cowords);
> cowords = FILTER cowords_interm BY (seedword!=$1);---GETS RID OF SINGLE
> DOC WORD;
> temp_join_count1 = JOIN cowords BY $0, word_doc_count BY seedword;
>
> -- GETS WORDS THAT OCCURE BY THEMSELVES IN A SINGLE DOCUMENT
> G = JOIN cowords_interm BY $0 LEFT OUTER, cowords by $0;
> orph_word = FILTER G BY $2 is null;
> orph_word_count = FOREACH orph_word GENERATE $0,null, 0;
>
> temp_join_count= UNION temp_join_count1, orph_word_count;
>
> inter_frac = FOREACH temp_join_count GENERATE $0 as seedword:chararray, $1
> as coword:chararray, 1.0/$3 as frac:double;
> inter_frac_combine = GROUP inter_frac BY (seedword, coword);
> inter_frac_sum = FOREACH inter_frac_combine GENERATE $0 ,
> SUM(inter_frac.frac) as frac:double;
>
> filtered = FILTER inter_frac_sum BY ($1 >=$relatedness_ratio);
> grouped= GROUP filtered by $0.seedword;
> g = FOREACH grouped GENERATE group as seedword:chararray, filtered.$0;
> named = FOREACH g GENERATE $0 as seedword:chararray, $1 as
> baggy:bag{(outertup:tuple(groupy:tuple(seedword:chararray,
> coword:chararray)))}
>
>
> The input file is
> car     doc1.txt
> auto    doc1.txt
> bunny   doc2.txt
> ball    doc2.txt
> toy car         doc2.txt
> random  doc3.txt
>
> Plane    doc3.txt
>
>
> On 1/12/12 4:58 PM, "Jonathan Coveney" <[email protected]> wrote:
>
> >Try:
> >
> >a = foreach grouped generate seedword, baggy.outertup as tup;
> >b = foreach grouped generate seedword, flatten(tup.groupy) as (coword);
> >
> >do you think you could post a script that gets you to the grouped part of
> >it? It would make it much easier to help you.
> >
> >2012/1/12 Yulia Tolskaya <[email protected]>
> >
> >> That did not work I get an error of :
> >> Cannot find field coword in groupy:tuple(seedword:chararray,coward:char
> >> array)
> >> IF I try
> >> FOREACH grouped GENERATE seedword, baggy.groupy;
> >> I also get an error:
> >> Invalid field reference. Referenced field [groupy] does not exist in
> >> schema: seedword:chararray,coward:char array.  (so it does seem to be
> >> ignoring all the nested tuples).
> >>
> >> This does seem like a bug!
> >> Can you think of a better solution? I could write a UDF to get rid of
> >>the
> >> nested tuples, but that really seems unnecessary.
> >>
> >> Thank you
> >>
> >> Yulia
> >>
> >>
> >> On 1/12/12 2:50 PM, "Jonathan Coveney" <[email protected]> wrote:
> >>
> >> >Hmm, this is an interesting case. I think there may be a bug here.
> >> >
> >> >grunt> grouped = load 'thing' as
> >> >(seedword:chararray,baggy:{outertup:(groupy:(seedword:chararray,
> >> >coword:chararray))});
> >> >grunt> describe grouped;
> >> >grouped: {seedword: chararray,baggy: {groupy: (seedword:
> >>chararray,coword:
> >> >chararray)}}
> >> >
> >> >notice that outertup has been thrown out. I suppose having a bunch of
> >> >nested tuples is equivalent to getting rid of them, still, it's not
> >> >something that pig should do. For the Pig devs, is this expected?
> >> >
> >> >Either way. in this case, you would do:
> >> >
> >> >a = foreach grouped generate seedword, baggy.coword;
> >> >
> >> >and go for there
> >> >
> >> >let me know if that works
> >> >
> >> >2012/1/12 Yulia Tolskaya <[email protected]>
> >> >
> >> >> I have been stuck on this for several hours and I cannot figure out
> >> >>what I
> >> >> am doing wrong. I have a relation "grouped" with the schema of
> >> >>
> >> >>  grouped: {seedword: chararray,baggy: {outertup: (groupy: (seedword:
> >> >> chararray,coword: chararray))}}
> >> >>
> >> >> I need to generate just the seedword and a tuple of cowords. In my
> >> >>example
> >> >> I would want
> >> >>
> >> >> (auto, (car, truck)).
> >> >>
> >> >> I have tried:
> >> >>
> >> >>  FOREACH grouped GENERATE baggy::outertup.groupy.coword;
> >> >>
> >> >>  FOREACH grouped GENERATE baggy.outertup.groupy.coword;
> >> >>  FOREACH grouped GENERATE baggy.groupy.coword;
> >> >>
> >> >>
> >> >> and none of these (or other similar variations) work, and give me
> >>error
> >> >> messages saying there is no such field. I"m assuming there has to be
> >>a
> >> >>way
> >> >> to do this. Please help!!
> >> >>
> >> >>
> >> >> Yulia
> >> >>
> >> >>
> >>
> >>
>
>

Reply via email to