Trying to figure that out now. Will keep you posted.

> Date: Mon, 16 Dec 2013 12:13:52 -0800
> Subject: Re: Data Vectorization
> From: [email protected]
> To: [email protected]
> 
> Looks reasonable.  Does it work?
> 
> 
> On Mon, Dec 16, 2013 at 12:09 PM, Sameer Tilak <[email protected]> wrote:
> 
> > Hi All,
> > I have some questions regarding vectorization.
> >
> > Here is my Pig script snippet.
> >
> > AU = FOREACH A GENERATE myparser.myUDF(param1, param2); STORE AU into
> > '/scratch/AU';
> > AU has the following format:
> > (userid, (item_view_history))
> > (27,(0,1,1,0,0))(28,(0,0,1,0,0))(29,(0,0,1,0,1))(30,(1,0,1,0,1))
> > I will have at least few hundred thousand numbers in the
> >  (item_view_history), for readability I am just showing 5 here.
> > I am not sure about how to get this data written to a format that Mahout's
> > clustering algorithms will be able to parse. I have the following steps,
> > but not sure if my understanding is correct. Any help with this will be
> > great!
> >
> > VectorizedInput = FOREACH AU GENERATE FLATTEN($0);
> >
> > /*I am assuming the filed userid will be used as a key and will be written
> > using $INT_CONVERTER', and the tuple will be written using
> > $VECTOR_CONVERTER'. Is this correct?
> > STORE VectorizedInput into '/scratch/VectorizedInput' using
> > $SEQFILE_STORAGE ('-c $INT_CONVERTER', '-c $VECTOR_CONVERTER');
> >
                                          

Reply via email to