Hi, I'm sitting on Cloudera 4.7 does it work aout of the box?
Right now I do expect from mahout simple interface: user_id, item_id, pref.
I do expect support for seq file / avro. Really, It's impossible to work
with TDF. Too much data... ^(





2014-08-16 20:16 GMT+04:00 Pat Ferrel <[email protected]>:

> The Spark version “spark-itemsimilarity” uses _your_ IDs. It is ready to
> try and I’d love it if you could. The IDs are kept in a HashBiMap in memory
> on each cluster machine and so it's memory limited to the size of the
> dictionary but in practice that will probably work for many (most)
> applications. This conversion of your ID into Mahout ID is done in the job
> and in parallel so it's about as fast as can be though we may be able to
> optimize the memory footprint in time.
>
> run “mahout spark-itemsimilarity” to get a full list of options. You can
> specify some form of text-delimited format for input—the default uses [\t,
> ] for the delimiter and expects (userID,itemID,ignored-text) but you can
> specify which column in the TDF contains which ID and even use filters to
> capture only the lines with data if you are using log files.
>
> I’ll see if I can get a doc up on the mahout site to explain it a bit
> better.
>
> As to providing input to Mahout in binary form, the Hadoop version of
> “rowsimilarity” takes a DRM sequence file. This would be a row per user
> containing a Mahout userID and Mahout SparseVector of the item
> interactions. You will still have to convert IDs though.
>
> On Aug 16, 2014, at 5:10 AM, Serega Sheypak <[email protected]>
> wrote:
>
> Hi, We are trying calculate ItemSimilarity.
> Right now we have 2*10^7 input lines. I do provide input data as raw text
> each day to recalculate item similarities. We do get +100..1000 new items
> each day.
> 1. It takes too much time to prepare input data.
> 2. It takes too much time to convert user_id, item_id to mahout ids
>
> Is there any poissibility to provide data to mahout mapreduce
> ItemSimilarity using some binary format with compression?
>
>

Reply via email to