Trying to figure that out now. Will keep you posted.
> Date: Mon, 16 Dec 2013 12:13:52 -0800 > Subject: Re: Data Vectorization > From: [email protected] > To: [email protected] > > Looks reasonable. Does it work? > > > On Mon, Dec 16, 2013 at 12:09 PM, Sameer Tilak <[email protected]> wrote: > > > Hi All, > > I have some questions regarding vectorization. > > > > Here is my Pig script snippet. > > > > AU = FOREACH A GENERATE myparser.myUDF(param1, param2); STORE AU into > > '/scratch/AU'; > > AU has the following format: > > (userid, (item_view_history)) > > (27,(0,1,1,0,0))(28,(0,0,1,0,0))(29,(0,0,1,0,1))(30,(1,0,1,0,1)) > > I will have at least few hundred thousand numbers in the > > (item_view_history), for readability I am just showing 5 here. > > I am not sure about how to get this data written to a format that Mahout's > > clustering algorithms will be able to parse. I have the following steps, > > but not sure if my understanding is correct. Any help with this will be > > great! > > > > VectorizedInput = FOREACH AU GENERATE FLATTEN($0); > > > > /*I am assuming the filed userid will be used as a key and will be written > > using $INT_CONVERTER', and the tuple will be written using > > $VECTOR_CONVERTER'. Is this correct? > > STORE VectorizedInput into '/scratch/VectorizedInput' using > > $SEQFILE_STORAGE ('-c $INT_CONVERTER', '-c $VECTOR_CONVERTER'); > >
