That didn't word :(
Here's my code
keywords = LOAD 'merged' USING as ( seedword:chararray, doc:chararray);
---COUNT HOW MANY DOCUMENTS EACH WORD IS IN
group_by_seedword = GROUP keywords BY $0;
invert_index = FOREACH group_by_seedword GENERATE $0 as
seedword:chararray, keywords.$1;
word_doc_count= FOREACH invert_index GENERATE seedword, COUNT($1);
-- map words to document
words_in_doc= GROUP keywords BY doc;
word_docs = FOREACH words_in_doc GENERATE group AS doc, keywords.seedword;
--(document:(keyword, keyword, keyword...))
--map words to their cowords in doc
temp_join = JOIN keywords BY doc,word_docs BY doc;
--DUMP temp_join;
cowords_by_doc = FOREACH temp_join GENERATE $0 as seedword:chararray, $3
as cowords;
cowords_interm= FOREACH cowords_by_doc GENERATE seedword,
FLATTEN(cowords);
cowords = FILTER cowords_interm BY (seedword!=$1);---GETS RID OF SINGLE
DOC WORD;
temp_join_count1 = JOIN cowords BY $0, word_doc_count BY seedword;
-- GETS WORDS THAT OCCURE BY THEMSELVES IN A SINGLE DOCUMENT
G = JOIN cowords_interm BY $0 LEFT OUTER, cowords by $0;
orph_word = FILTER G BY $2 is null;
orph_word_count = FOREACH orph_word GENERATE $0,null, 0;
temp_join_count= UNION temp_join_count1, orph_word_count;
inter_frac = FOREACH temp_join_count GENERATE $0 as seedword:chararray, $1
as coword:chararray, 1.0/$3 as frac:double;
inter_frac_combine = GROUP inter_frac BY (seedword, coword);
inter_frac_sum = FOREACH inter_frac_combine GENERATE $0 ,
SUM(inter_frac.frac) as frac:double;
filtered = FILTER inter_frac_sum BY ($1 >=$relatedness_ratio);
grouped= GROUP filtered by $0.seedword;
g = FOREACH grouped GENERATE group as seedword:chararray, filtered.$0;
named = FOREACH g GENERATE $0 as seedword:chararray, $1 as
baggy:bag{(outertup:tuple(groupy:tuple(seedword:chararray,
coword:chararray)))}
The input file is
car doc1.txt
auto doc1.txt
bunny doc2.txt
ball doc2.txt
toy car doc2.txt
random doc3.txt
Plane doc3.txt
On 1/12/12 4:58 PM, "Jonathan Coveney" <[email protected]> wrote:
>Try:
>
>a = foreach grouped generate seedword, baggy.outertup as tup;
>b = foreach grouped generate seedword, flatten(tup.groupy) as (coword);
>
>do you think you could post a script that gets you to the grouped part of
>it? It would make it much easier to help you.
>
>2012/1/12 Yulia Tolskaya <[email protected]>
>
>> That did not work I get an error of :
>> Cannot find field coword in groupy:tuple(seedword:chararray,coward:char
>> array)
>> IF I try
>> FOREACH grouped GENERATE seedword, baggy.groupy;
>> I also get an error:
>> Invalid field reference. Referenced field [groupy] does not exist in
>> schema: seedword:chararray,coward:char array. (so it does seem to be
>> ignoring all the nested tuples).
>>
>> This does seem like a bug!
>> Can you think of a better solution? I could write a UDF to get rid of
>>the
>> nested tuples, but that really seems unnecessary.
>>
>> Thank you
>>
>> Yulia
>>
>>
>> On 1/12/12 2:50 PM, "Jonathan Coveney" <[email protected]> wrote:
>>
>> >Hmm, this is an interesting case. I think there may be a bug here.
>> >
>> >grunt> grouped = load 'thing' as
>> >(seedword:chararray,baggy:{outertup:(groupy:(seedword:chararray,
>> >coword:chararray))});
>> >grunt> describe grouped;
>> >grouped: {seedword: chararray,baggy: {groupy: (seedword:
>>chararray,coword:
>> >chararray)}}
>> >
>> >notice that outertup has been thrown out. I suppose having a bunch of
>> >nested tuples is equivalent to getting rid of them, still, it's not
>> >something that pig should do. For the Pig devs, is this expected?
>> >
>> >Either way. in this case, you would do:
>> >
>> >a = foreach grouped generate seedword, baggy.coword;
>> >
>> >and go for there
>> >
>> >let me know if that works
>> >
>> >2012/1/12 Yulia Tolskaya <[email protected]>
>> >
>> >> I have been stuck on this for several hours and I cannot figure out
>> >>what I
>> >> am doing wrong. I have a relation "grouped" with the schema of
>> >>
>> >> grouped: {seedword: chararray,baggy: {outertup: (groupy: (seedword:
>> >> chararray,coword: chararray))}}
>> >>
>> >> I need to generate just the seedword and a tuple of cowords. In my
>> >>example
>> >> I would want
>> >>
>> >> (auto, (car, truck)).
>> >>
>> >> I have tried:
>> >>
>> >> FOREACH grouped GENERATE baggy::outertup.groupy.coword;
>> >>
>> >> FOREACH grouped GENERATE baggy.outertup.groupy.coword;
>> >> FOREACH grouped GENERATE baggy.groupy.coword;
>> >>
>> >>
>> >> and none of these (or other similar variations) work, and give me
>>error
>> >> messages saying there is no such field. I"m assuming there has to be
>>a
>> >>way
>> >> to do this. Please help!!
>> >>
>> >>
>> >> Yulia
>> >>
>> >>
>>
>>