Hi All,

We are using Apache Pig for building our data pipeline. We have data in the 
following fashion:

userid, items {code 1, code 2, ….}, few other features...

Each item has a unique alphanumeric code. I would like to use mahout for 
clustering it. To vectorize the data, we are represent info on item codes as 1 
X M matrix where a column represents an items (1 if a given user has viewed a 
particular item 0 otherwise) and will have millions of columns. So each user 
will have id, and this matrix. I am generating the matrix in a Pig UDF. 

AU = FOREACH A GENERATE FLATTEN(myparser.myUDF(key, values)); 

/*Data I get back from my UDF should have the following format: 
{(userid,1,0,0,1,0,.........)} */ 

STORE AU into 'vector.out' using $SEQFILE_STORAGE ('-c $INT_CONVERTER', '-c 
$VECTOR_CONVERTER');

/* Use mahout for analyzing the data */

I am returning a bag from my UDF because the data potentially can have hundreds 
of millions of items and from my understanding for a tuple everything needs to 
fit into memory. Is there a better way of doing this? I want to make sure that 
I am on right track.

                                          

Reply via email to