Re: ItemSimilarityJob

Sebastian Schelter Mon, 04 Jun 2012 20:04:09 -0700

The input data does NOT have to be in a particular order.

--sebastian


On 05.06.2012 01:31, Something Something wrote:
> So data has to be in "Order By UserId, ItemID, Preference"?  Hmm.. for a
> file containing billion rows this may take some time, but if that's what it
> wants that's what I will provide.  Please confirm.  Thanks.
> 
> On Mon, Jun 4, 2012 at 4:20 PM, Lance Norskog <[email protected]> wrote:
> 
>> It needs a complete "ordering", meaning code that takes any two values
>> and says "this one before that one". This lets Hadoop do global
>> sorting. If they're strings you would sort on the strings.
>>
>> On Mon, Jun 4, 2012 at 4:00 PM, Something Something
>> <[email protected]> wrote:
>>> Fair enough.  Just one more question:
>>>
>>> 1)  >>it just needs to have an ordering
>>> The input data doesn't need to be in any particular sequence, correct?
>>  Not
>>> sure what you mean by 'needs to have an ordering'.
>>>
>>>
>>> On Mon, Jun 4, 2012 at 3:29 PM, Sean Owen <[email protected]> wrote:
>>>
>>>> That's how it used to work but it was restricted to integers a long time
>>>> ago purely for speed and memory. It makes a big difference. Many (most?)
>>>> use cases have some numeric ID for these guys already.  Otherwise no
>> reason
>>>> it needs to be an integer it just needs to have an ordering.
>>>>
>>>> You can retain the mapping how you like. All you really need are the
>>>> original ID values to recreate the mapping as it is just bases on MD5.
>> So a
>>>> file is sufficient for example. But to do the mapping on the fly it has
>> to
>>>> be in memory yes or else it is too slow.
>>>>
>>>> Best is to find a numeric ID to use in your model if you can.
>>>>
>>>> Myrrix works this way too, if desired, but almost as a feature as the
>>>> 'real' IDs need never be sent into the hosted recommender in the cloud,
>>>> just a hashed numeric ID. That's nice from a security or privacy
>>>> standpoint.
>>>>  On Jun 4, 2012 11:05 PM, "Something Something" <
>> [email protected]>
>>>> wrote:
>>>>
>>>>> Hmm.. that's a bit weird.  Looking at the algorithm, I don't
>> understand
>>>> why
>>>>> UserID has to be Long.  It's just an Identifier of a row, isn't it?
>>  The
>>>>> algorithm really only works with Item IDs and even with ItemIDs I
>> would
>>>>> argue they don't need to be Numeric.  Am I missing something?
>>>>>
>>>>> We have over billion user ids.  So for each ID I need to create a
>>>>> corresponding 'long' value in Memory?  Is that what this class is
>> doing?
>>>>>
>>>>> On Mon, Jun 4, 2012 at 2:50 PM, Manuel Blechschmidt <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Something,
>>>>>> actually this is correct.
>>>>>>
>>>>>> You can use the MemoryIDMigrator
>>>>>>
>>>>>
>>>>
>> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/impl/model/MemoryIDMigrator.htmltocreateLongsfrom
>>  your strings.
>>>>>>
>>>>>> /Manuel
>>>>>>
>>>>>> On 04.06.2012, at 23:47, Something Something wrote:
>>>>>>
>>>>>>> Trying to use this class.  Noticed that 'UserID' must be Long.
>>  That
>>>>>>> doesn't sound right.  Isn't there a way to tell this class that
>> the
>>>>>>> 'UserID' is String?  Please let me know.  Thanks.
>>>>>>
>>>>>> --
>>>>>> Manuel Blechschmidt
>>>>>> M.Sc. IT Systems Engineering
>>>>>> Dortustr. 57
>>>>>> 14467 Potsdam
>>>>>> Mobil: 0173/6322621
>>>>>> Twitter: http://twitter.com/Manuel_B
>>>>>>
>>>>>>
>>>>>
>>>>
>>
>>
>>
>> --
>> Lance Norskog
>> [email protected]
>>
>

Re: ItemSimilarityJob

Reply via email to