The Spark version “spark-itemsimilarity” uses _your_ IDs. It is ready to try and I’d love it if you could. The IDs are kept in a HashBiMap in memory on each cluster machine and so it's memory limited to the size of the dictionary but in practice that will probably work for many (most) applications. This conversion of your ID into Mahout ID is done in the job and in parallel so it's about as fast as can be though we may be able to optimize the memory footprint in time.
run “mahout spark-itemsimilarity” to get a full list of options. You can specify some form of text-delimited format for input—the default uses [\t, ] for the delimiter and expects (userID,itemID,ignored-text) but you can specify which column in the TDF contains which ID and even use filters to capture only the lines with data if you are using log files. I’ll see if I can get a doc up on the mahout site to explain it a bit better. As to providing input to Mahout in binary form, the Hadoop version of “rowsimilarity” takes a DRM sequence file. This would be a row per user containing a Mahout userID and Mahout SparseVector of the item interactions. You will still have to convert IDs though. On Aug 16, 2014, at 5:10 AM, Serega Sheypak <[email protected]> wrote: Hi, We are trying calculate ItemSimilarity. Right now we have 2*10^7 input lines. I do provide input data as raw text each day to recalculate item similarities. We do get +100..1000 new items each day. 1. It takes too much time to prepare input data. 2. It takes too much time to convert user_id, item_id to mahout ids Is there any poissibility to provide data to mahout mapreduce ItemSimilarity using some binary format with compression?
