Re: Data Vectorization

Andrew Musselman Mon, 16 Dec 2013 12:14:56 -0800

Looks reasonable.  Does it work?


On Mon, Dec 16, 2013 at 12:09 PM, Sameer Tilak <[email protected]> wrote:

> Hi All,
> I have some questions regarding vectorization.
>
> Here is my Pig script snippet.
>
> AU = FOREACH A GENERATE myparser.myUDF(param1, param2); STORE AU into
> '/scratch/AU';
> AU has the following format:
> (userid, (item_view_history))
> (27,(0,1,1,0,0))(28,(0,0,1,0,0))(29,(0,0,1,0,1))(30,(1,0,1,0,1))
> I will have at least few hundred thousand numbers in the
>  (item_view_history), for readability I am just showing 5 here.
> I am not sure about how to get this data written to a format that Mahout's
> clustering algorithms will be able to parse. I have the following steps,
> but not sure if my understanding is correct. Any help with this will be
> great!
>
> VectorizedInput = FOREACH AU GENERATE FLATTEN($0);
>
> /*I am assuming the filed userid will be used as a key and will be written
> using $INT_CONVERTER', and the tuple will be written using
> $VECTOR_CONVERTER'. Is this correct?
> STORE VectorizedInput into '/scratch/VectorizedInput' using
> $SEQFILE_STORAGE ('-c $INT_CONVERTER', '-c $VECTOR_CONVERTER');
>

Re: Data Vectorization

Reply via email to