UserID and ItemID are usually domain-level keys, not generated by the DB. With some of the movie databases, you get tables of "user/item/pref/time", "item/moviename/genre", and maybe "user/geocode".
Lance On Tue, May 31, 2011 at 9:51 PM, Mike Khristo <[email protected]> wrote: > Using the 0.6 snapshot + patch 705 (mongodatamodel) from jira ( > https://issues.apache.org/jira/browse/MAHOUT-705), and a test data set with > ~300k rows like: > > "4cec0a2934ac9fbd2b040000","4d065d5434ac9f5227a12f00",118 > > It's slowly doing the translations: > INFO: [+++][MONGO-MAP] Adding Translation Item ID: > 4d57d54434ac9fd3570005a2 long_value: 145367 > > It's doing about 30,000 per hour (and getting slower). That's 8.3/sec. > 8G ram, 4 virtual cores > > With a test data set of 3M preferences, that would take >5 days, just for > the translation. > > Open to ideas/suggestions/"a-ha"-moments. Thanks! > > > > > On Tue, May 31, 2011 at 9:15 PM, Ted Dunning <[email protected]> wrote: > >> It makes the internals much cleaner to not repeat this conversion. >> >> But how is it that this is taking a long time? String -> lookup should not >> be much longer than an array access, especially if you use the Mahout >> collections or one of the dictionary types. >> >> On Tue, May 31, 2011 at 7:50 PM, Mike Khristo <[email protected]> >> wrote: >> >> > Rather, how can I use string-based userid/itemid's without having the >> deal >> > with the slowness associated with mapping them to a long? >> > >> > In the MongoDataModel, for example, significant time/overhead goes into >> > converting the unique id's to long... I'm still getting my head wrapped >> > around mahout, but this seems like a significant limitation. I have to >> > assume there's some logic behind the decision to restrict them to long, >> but >> > i didn't find anything about it in Mahout in Action or the list. >> > >> > Thanks. >> > >> > -- Lance Norskog [email protected]
