Hi Bernat, you can do the offline similarity calculation on a single machine with o.a.m.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
or on a Hadoop cluster (if necessary) with o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob I think such a setup is much easier than coming up with a complicated caching logic. Best, Sebastian On 23.04.2013 15:09, Gabor Bernat wrote: > Well, > > assume that we have a live system. The system is serving requests 24/7, and > in order to update the model periodically we recreate it on another machine > and just save it (like to a database), and as update just load the saved > data model and exchange it with the one running prior. On this another > machine the similarities could be calculated also, saved, and on the new > machine just load it. > > The adventage in this is that you can offload similarity calculation to an > offline system without paying the cold start price on the live/online > system. Loading the pre calcualted top used similarities should certainly > be less costly than doing all this on the live system, even if concurently. > The problem with calculating them on the live system is that for the first > 30-40 minutes you'll have long response times compared to the state when > your cache is filled. > > > Bernát GÁBOR > > > On Tue, Apr 23, 2013 at 2:54 PM, Sean Owen <[email protected]> wrote: > >> I agree, but how is "pre-adding a cached value for X" different than >> "requesting X from the cache"? Either way you get X in the cache. >> Computing offline seems the same as computing on-line, but in some >> kind of warm-up state or phase. Which can be concurrent with serving >> early requests even. You can do everything else you say without a new >> operation, like selectively pre-caching certain entries. >> >> On Tue, Apr 23, 2013 at 1:14 PM, Gabor Bernat <[email protected]> >> wrote: >>> CachingSimilarity also allowed to add manually entries, because in that >>> case this task could be pushed off to an offline system. And yes, you >>> cannot add all the similarities to the caching object, however based on >>> history you can select some top (popular) item pairs, and just calculate >>> for that subset. This could push down the upper request times. Any other >>> ideas? >>> >> >
