That's better, but still pretty slow. On Tue, May 31, 2011 at 11:16 PM, Mike Khristo <[email protected]>wrote:
> I haven't modified the patch, but yes, it appears to be storing the > translations into a collection it creates (mongo_data_model_map): > https://issues.apache.org/jira/secure/attachment/12479895/MAHOUT-705.patch > > The patch doesn't put any indexes on the mongoMapCollection. > > Just added the following: > db.mongo_data_model_map.ensureIndex({element_id : 1}) > db.mongo_data_model_map.ensureIndex({long_value : 1}) > > Now's it's doing about 50k translations per minute (as opposed to 30k per > hour). > > > > > > On Tue, May 31, 2011 at 11:01 PM, Ted Dunning <[email protected]> > wrote: > > > Are you putting the translations into Mongo? > > > > On Tue, May 31, 2011 at 9:51 PM, Mike Khristo <[email protected]> > > wrote: > > > > > Using the 0.6 snapshot + patch 705 (mongodatamodel) from jira ( > > > https://issues.apache.org/jira/browse/MAHOUT-705), and a test data set > > > with > > > ~300k rows like: > > > > > > "4cec0a2934ac9fbd2b040000","4d065d5434ac9f5227a12f00",118 > > > > > > It's slowly doing the translations: > > > INFO: [+++][MONGO-MAP] Adding Translation Item ID: > > > 4d57d54434ac9fd3570005a2 long_value: 145367 > > > > > > It's doing about 30,000 per hour (and getting slower). That's 8.3/sec. > > > 8G ram, 4 virtual cores > > > > > > With a test data set of 3M preferences, that would take >5 days, just > for > > > the translation. > > > > > > Open to ideas/suggestions/"a-ha"-moments. Thanks! > > > > > > > > > > > > > > > On Tue, May 31, 2011 at 9:15 PM, Ted Dunning <[email protected]> > > > wrote: > > > > > > > It makes the internals much cleaner to not repeat this conversion. > > > > > > > > But how is it that this is taking a long time? String -> lookup > should > > > not > > > > be much longer than an array access, especially if you use the Mahout > > > > collections or one of the dictionary types. > > > > > > > > On Tue, May 31, 2011 at 7:50 PM, Mike Khristo <[email protected] > > > > > > wrote: > > > > > > > > > Rather, how can I use string-based userid/itemid's without having > the > > > > deal > > > > > with the slowness associated with mapping them to a long? > > > > > > > > > > In the MongoDataModel, for example, significant time/overhead goes > > into > > > > > converting the unique id's to long... I'm still getting my head > > > wrapped > > > > > around mahout, but this seems like a significant limitation. I have > > to > > > > > assume there's some logic behind the decision to restrict them to > > long, > > > > but > > > > > i didn't find anything about it in Mahout in Action or the list. > > > > > > > > > > Thanks. > > > > > > > > > > > > > > >
