Vectorizing data in mapreduce mode

Sameer Tilak Mon, 23 Dec 2013 12:06:15 -0800

Hi everyone,

My Pig script generates the following -- results are stored in part-m-00000 to 
part-m-00004 files.


-bash-4.1$ hadoop dfs -ls /scratch/ItemIds

Found 7 items
-rw-r--r--   1 userid supergroup          0 2013-12-23 11:13 
/scratch/ItemIds/_SUCCESS
drwxr-xr-x   - userid supergroup          0 2013-12-23 11:12 
/scratch/ItemIds/_logs
-rw-r--r--   1 userid supergroup     276019 2013-12-23 11:12 
/scratch/ItemIds/part-m-00000
-rw-r--r--   1 userid supergroup     272188 2013-12-23 11:12 
/scratch/ItemIds/part-m-00001
-rw-r--r--   1 userid supergroup     252597 2013-12-23 11:12 
/scratch/ItemIds/part-m-00002
-rw-r--r--   1 userid supergroup     236508 2013-12-23 11:12 
/scratch/ItemIds/part-m-00003
-rw-r--r--   1 userid supergroup     270658 2013-12-23 11:12 
/scratch/ItemIds/part-m-00004

 The output is stored as the Tab separated values:

userid1 itemid1 itemid2 itemid3 ......
userid2 itemid1 itemid2 itemid3 ......
......

I have following questions:

1. Is there a mahout utility that lets me point to the  /scratch/ItemIds and 
will generate one file out of these 5 part files?

2. What is the recommended way of parsing this tab separated file in a 
mapreduce mode? I want to vectorize this data and would like to do that in a 
parallel mode. I know how to vectorize the data correctly and how to run 
K-means on that. 

I have been using the following command to run my clustering algorithm on dummy 
data. Now, I want to ingest real data.

hadoop jar /apps/analytics/myanalytics.jar myanalytics.SimpleKMeansClustering 
-libjars /apps/mahout/trunk/core/target/mahout-core-0.9-SNAPSHOT.jar 
/:/apps/mahout/trunk/core/target/mahout-core-0.9-SNAPSHOT-job.jar:/apps/mahout/trunk/math/target/mahout-math-0.9-SNAPSHOT.jar

However, I am not sure if I write the code to vectorize data in my 
SimpleKMeansClustering class, will the above command run it in mapreduce mode?

Vectorizing data in mapreduce mode

Reply via email to