The input data does NOT have to be in a particular order. --sebastian
On 05.06.2012 01:31, Something Something wrote: > So data has to be in "Order By UserId, ItemID, Preference"? Hmm.. for a > file containing billion rows this may take some time, but if that's what it > wants that's what I will provide. Please confirm. Thanks. > > On Mon, Jun 4, 2012 at 4:20 PM, Lance Norskog <[email protected]> wrote: > >> It needs a complete "ordering", meaning code that takes any two values >> and says "this one before that one". This lets Hadoop do global >> sorting. If they're strings you would sort on the strings. >> >> On Mon, Jun 4, 2012 at 4:00 PM, Something Something >> <[email protected]> wrote: >>> Fair enough. Just one more question: >>> >>> 1) >>it just needs to have an ordering >>> The input data doesn't need to be in any particular sequence, correct? >> Not >>> sure what you mean by 'needs to have an ordering'. >>> >>> >>> On Mon, Jun 4, 2012 at 3:29 PM, Sean Owen <[email protected]> wrote: >>> >>>> That's how it used to work but it was restricted to integers a long time >>>> ago purely for speed and memory. It makes a big difference. Many (most?) >>>> use cases have some numeric ID for these guys already. Otherwise no >> reason >>>> it needs to be an integer it just needs to have an ordering. >>>> >>>> You can retain the mapping how you like. All you really need are the >>>> original ID values to recreate the mapping as it is just bases on MD5. >> So a >>>> file is sufficient for example. But to do the mapping on the fly it has >> to >>>> be in memory yes or else it is too slow. >>>> >>>> Best is to find a numeric ID to use in your model if you can. >>>> >>>> Myrrix works this way too, if desired, but almost as a feature as the >>>> 'real' IDs need never be sent into the hosted recommender in the cloud, >>>> just a hashed numeric ID. That's nice from a security or privacy >>>> standpoint. >>>> On Jun 4, 2012 11:05 PM, "Something Something" < >> [email protected]> >>>> wrote: >>>> >>>>> Hmm.. that's a bit weird. Looking at the algorithm, I don't >> understand >>>> why >>>>> UserID has to be Long. It's just an Identifier of a row, isn't it? >> The >>>>> algorithm really only works with Item IDs and even with ItemIDs I >> would >>>>> argue they don't need to be Numeric. Am I missing something? >>>>> >>>>> We have over billion user ids. So for each ID I need to create a >>>>> corresponding 'long' value in Memory? Is that what this class is >> doing? >>>>> >>>>> On Mon, Jun 4, 2012 at 2:50 PM, Manuel Blechschmidt < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi Something, >>>>>> actually this is correct. >>>>>> >>>>>> You can use the MemoryIDMigrator >>>>>> >>>>> >>>> >> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/impl/model/MemoryIDMigrator.htmltocreateLongsfrom >> your strings. >>>>>> >>>>>> /Manuel >>>>>> >>>>>> On 04.06.2012, at 23:47, Something Something wrote: >>>>>> >>>>>>> Trying to use this class. Noticed that 'UserID' must be Long. >> That >>>>>>> doesn't sound right. Isn't there a way to tell this class that >> the >>>>>>> 'UserID' is String? Please let me know. Thanks. >>>>>> >>>>>> -- >>>>>> Manuel Blechschmidt >>>>>> M.Sc. IT Systems Engineering >>>>>> Dortustr. 57 >>>>>> 14467 Potsdam >>>>>> Mobil: 0173/6322621 >>>>>> Twitter: http://twitter.com/Manuel_B >>>>>> >>>>>> >>>>> >>>> >> >> >> >> -- >> Lance Norskog >> [email protected] >> >
