I don't think this implementation is going to be practical for any significant scale, it's more of a toy implementation that reads into memory. You're welcome to propose a speedup patch if it doesn't break the semantics. I would not use Mongo this way nor would I probably use the netflix data set as-is in a non-distributed setup.
On Thu, Nov 15, 2012 at 3:23 PM, Onur Kuru <[email protected]> wrote: > Hello! > > I have exported Netflix data to a mongo db and then tried to build a > MongoDBDataModel but it is taking too long. As I inspected the > MongoDBDataModel class I found out that it's making a conversion from > string to long because mongo uses strings for user_id and item_id, and > mahout uses long for ids. > > MongoDBDataModel stores this conversions in another collection and as it > iterates over all the documents in the ratings collection, it checks this > conversion collection whether it assigned a long id to every string id(user > & item). I think checking/creating a new one(if necessary) in this > collection becomes a great overhead when the data is too big. > > Is there any solution to this which is included in mahout or do I have to > write my own optimized code? > > Regards, > Onur
