Hi, I'm sitting on Cloudera 4.7 does it work aout of the box? Right now I do expect from mahout simple interface: user_id, item_id, pref. I do expect support for seq file / avro. Really, It's impossible to work with TDF. Too much data... ^(
2014-08-16 20:16 GMT+04:00 Pat Ferrel <[email protected]>: > The Spark version “spark-itemsimilarity” uses _your_ IDs. It is ready to > try and I’d love it if you could. The IDs are kept in a HashBiMap in memory > on each cluster machine and so it's memory limited to the size of the > dictionary but in practice that will probably work for many (most) > applications. This conversion of your ID into Mahout ID is done in the job > and in parallel so it's about as fast as can be though we may be able to > optimize the memory footprint in time. > > run “mahout spark-itemsimilarity” to get a full list of options. You can > specify some form of text-delimited format for input—the default uses [\t, > ] for the delimiter and expects (userID,itemID,ignored-text) but you can > specify which column in the TDF contains which ID and even use filters to > capture only the lines with data if you are using log files. > > I’ll see if I can get a doc up on the mahout site to explain it a bit > better. > > As to providing input to Mahout in binary form, the Hadoop version of > “rowsimilarity” takes a DRM sequence file. This would be a row per user > containing a Mahout userID and Mahout SparseVector of the item > interactions. You will still have to convert IDs though. > > On Aug 16, 2014, at 5:10 AM, Serega Sheypak <[email protected]> > wrote: > > Hi, We are trying calculate ItemSimilarity. > Right now we have 2*10^7 input lines. I do provide input data as raw text > each day to recalculate item similarities. We do get +100..1000 new items > each day. > 1. It takes too much time to prepare input data. > 2. It takes too much time to convert user_id, item_id to mahout ids > > Is there any poissibility to provide data to mahout mapreduce > ItemSimilarity using some binary format with compression? > >
