Hi Bernat,

you can do the offline similarity calculation on a single machine with
o.a.m.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities

or on a Hadoop cluster (if necessary) with
o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob

I think such a setup is much easier than coming up with a complicated
caching logic.

Best,
Sebastian

On 23.04.2013 15:09, Gabor Bernat wrote:
> Well,
> 
> assume that we have a live system. The system is serving requests 24/7, and
> in order to update the model periodically we recreate it on another machine
> and just save it (like to a database), and as update just load the saved
> data model and exchange it with the one running prior. On this another
> machine the similarities could be calculated also, saved, and on the new
> machine just load it.
> 
> The adventage in this is that you can offload similarity calculation to an
> offline system without paying the cold start price on the live/online
> system. Loading the pre calcualted top used similarities should certainly
> be less costly than doing all this on the live system, even if concurently.
> The problem with calculating them on the live system is that for the first
> 30-40 minutes you'll have long response times compared to the state when
> your cache is filled.
> 
> 
> Bernát GÁBOR
> 
> 
> On Tue, Apr 23, 2013 at 2:54 PM, Sean Owen <[email protected]> wrote:
> 
>> I agree, but how is "pre-adding a cached value for X" different than
>> "requesting X from the cache"? Either way you get X in the cache.
>> Computing offline seems the same as computing on-line, but in some
>> kind of warm-up state or phase. Which can be concurrent with serving
>> early requests even. You can do everything else you say without a new
>> operation, like selectively pre-caching certain entries.
>>
>> On Tue, Apr 23, 2013 at 1:14 PM, Gabor Bernat <[email protected]>
>> wrote:
>>> CachingSimilarity also allowed to add manually entries, because in that
>>> case this task could be pushed off to an offline system. And yes, you
>>> cannot add all the similarities to the caching object, however based on
>>> history you can select some top (popular) item pairs, and just calculate
>>> for that subset. This could push down the upper request times. Any other
>>> ideas?
>>>
>>
> 

Reply via email to