Hadoop supports automatic compression/decompression of text files so that 
shouldn’t be a problem. At some point the data is almost always text, out goal 
is to take data in that form and make as much data prep as possible 
unnecessary. 

I don’t know what is in Cloudera 4.7 but Mahout will work with it out of the 
box. You’ll have to ask cloudera for other specifics. Mahout 0.9 pre-built 
artifacts are all you need to use sequence file input with RSJ.

The spark-itemsimilarity is only in the snapshot of Mahout 1.0 so you’ll need 
to download and build it. It requires Spark as well as Hadoop but so you’ll 
need to see if that is installed. If you want to write your own wrappers for 
the CooccurrenceAnalysis.Cooccurrence you can implement whatever format you 
want but you’ll have to do your own ID translation. Cooccurrence at that level 
of the pipeline requires Mahout IDs.

Remember that most work is done in Spark with RDDs, which do away with the need 
for intermediate files. You will generally only use them to store results and 
input. Think of them as import/export files, not working data. 

On Aug 16, 2014, at 10:32 AM, Serega Sheypak <[email protected]> wrote:

Hi, I'm sitting on Cloudera 4.7 does it work aout of the box?
Right now I do expect from mahout simple interface: user_id, item_id, pref.
I do expect support for seq file / avro. Really, It's impossible to work
with TDF. Too much data... ^(





2014-08-16 20:16 GMT+04:00 Pat Ferrel <[email protected]>:

> The Spark version “spark-itemsimilarity” uses _your_ IDs. It is ready to
> try and I’d love it if you could. The IDs are kept in a HashBiMap in memory
> on each cluster machine and so it's memory limited to the size of the
> dictionary but in practice that will probably work for many (most)
> applications. This conversion of your ID into Mahout ID is done in the job
> and in parallel so it's about as fast as can be though we may be able to
> optimize the memory footprint in time.
> 
> run “mahout spark-itemsimilarity” to get a full list of options. You can
> specify some form of text-delimited format for input—the default uses [\t,
> ] for the delimiter and expects (userID,itemID,ignored-text) but you can
> specify which column in the TDF contains which ID and even use filters to
> capture only the lines with data if you are using log files.
> 
> I’ll see if I can get a doc up on the mahout site to explain it a bit
> better.
> 
> As to providing input to Mahout in binary form, the Hadoop version of
> “rowsimilarity” takes a DRM sequence file. This would be a row per user
> containing a Mahout userID and Mahout SparseVector of the item
> interactions. You will still have to convert IDs though.
> 
> On Aug 16, 2014, at 5:10 AM, Serega Sheypak <[email protected]>
> wrote:
> 
> Hi, We are trying calculate ItemSimilarity.
> Right now we have 2*10^7 input lines. I do provide input data as raw text
> each day to recalculate item similarities. We do get +100..1000 new items
> each day.
> 1. It takes too much time to prepare input data.
> 2. It takes too much time to convert user_id, item_id to mahout ids
> 
> Is there any poissibility to provide data to mahout mapreduce
> ItemSimilarity using some binary format with compression?
> 
> 

Reply via email to