Hi All,
I have some questions regarding vectorization.
Here is my Pig script snippet.
AU = FOREACH A GENERATE myparser.myUDF(param1, param2); STORE AU into
'/scratch/AU';
AU has the following format:
(userid, (item_view_history))
(27,(0,1,1,0,0))(28,(0,0,1,0,0))(29,(0,0,1,0,1))(30,(1,0,1,0,1))
I will have at least few hundred thousand numbers in the (item_view_history),
for readability I am just showing 5 here.
I am not sure about how to get this data written to a format that Mahout's
clustering algorithms will be able to parse. I have the following steps, but
not sure if my understanding is correct. Any help with this will be great!
VectorizedInput = FOREACH AU GENERATE FLATTEN($0);
/*I am assuming the filed userid will be used as a key and will be written
using $INT_CONVERTER', and the tuple will be written using $VECTOR_CONVERTER'.
Is this correct?
STORE VectorizedInput into '/scratch/VectorizedInput' using $SEQFILE_STORAGE
('-c $INT_CONVERTER', '-c $VECTOR_CONVERTER');