Re: mapreduce ItemSimilarity input optimization

Pat Ferrel Sat, 16 Aug 2014 09:18:11 -0700

The Spark version “spark-itemsimilarity” uses _your_ IDs. It is ready to try 
and I’d love it if you could. The IDs are kept in a HashBiMap in memory on each 
cluster machine and so it's memory limited to the size of the dictionary but in 
practice that will probably work for many (most) applications. This conversion 
of your ID into Mahout ID is done in the job and in parallel so it's about as 
fast as can be though we may be able to optimize the memory footprint in time.


run “mahout spark-itemsimilarity” to get a full list of options. You can 
specify some form of text-delimited format for input—the default uses [\t, ] 
for the delimiter and expects (userID,itemID,ignored-text) but you can specify 
which column in the TDF contains which ID and even use filters to capture only 
the lines with data if you are using log files.

I’ll see if I can get a doc up on the mahout site to explain it a bit better.

As to providing input to Mahout in binary form, the Hadoop version of 
“rowsimilarity” takes a DRM sequence file. This would be a row per user 
containing a Mahout userID and Mahout SparseVector of the item interactions. 
You will still have to convert IDs though.

On Aug 16, 2014, at 5:10 AM, Serega Sheypak <[email protected]> wrote:

Hi, We are trying calculate ItemSimilarity.
Right now we have 2*10^7 input lines. I do provide input data as raw text
each day to recalculate item similarities. We do get +100..1000 new items
each day.
1. It takes too much time to prepare input data.
2. It takes too much time to convert user_id, item_id to mahout ids

Is there any poissibility to provide data to mahout mapreduce
ItemSimilarity using some binary format with compression?

Re: mapreduce ItemSimilarity input optimization

Reply via email to