+1 on vector properties

On Aug 4, 2013, at 5:34 PM, Pat Ferrel <[email protected]> wrote:

> It does bring up a nice way to order the items in the A and B docs, by 
> timestamp if available. That way when you get an h_b doc from B for the query:
> 
> recommend based on behavior with regard to B items and actions h_b
>      query is [b-b-links: h_b]
> 
> the h_b items are ordered by recency. You can truncate based on the number of 
> actions you want to consider. This should be very easy to implement if only 
> we could attach data to the items in the DRMs
> 
> Actually this brings up another point that I've harped on before. It sure 
> would be nice to have a vector representation where you could attache 
> arbitrary data to items or vectors. Not so memory efficient but it makes 
> things like ID translation and timestamping actions trivial. If these could 
> be attached and survive all the Mahout jobs there would be no need for the 
> in-memory hashmap I'm using to translate IDs and the actions could be 
> timestamped or other metadata could be attached. At present I guess everyone 
> knows that only weights are attached to actions/matrix values and in some 
> cases names to rows/vectors in DRMs. 
> 
> 
> On Aug 4, 2013, at 12:59 PM, Ted Dunning <[email protected]> wrote:
> 
> On Sun, Aug 4, 2013 at 9:35 AM, Pat Ferrel <[email protected]> wrote:
> 
>> 2) This is not ideal way to downsample if I understand the code. It keeps
>> the first items ingested which has nothing to do with their timestamp.
>> You'd ideally want to truncate based on the order the actions were taken by
>> the user keeping the newest.
> 
> 
> 
> There are at least three options for down-sampling.  All have arguments in
> their favor and probably have good applications.  I don't think it actually
> matters, however, since down-sampling should mostly be applied to
> pathological cases like bots or QA teams.
> 
> The options that I know of include:
> 
> 1) take the first events you see.  This is easy.  For content, it may be
> best to do this because this gives you information about the context of the
> content when it first appears.  For users, this may be worst as a
> characterization of the user now, but it may be near best for the off-line
> item-item analysis because it preserves a densely sampled view of some past
> moment in time.
> 
> 2) take the last events you see.  This is also easy, but not quite as easy
> as (1) since you can't stop early if you see the data in chronological
> order.  For content, this gives you the latest view of the content and
> pushes all data for all items into the same time frame which might increase
> overlap in the offline analysis. For users at recommendation, it is
> probably exactly what you want.
> 
> 3) take some time-weighted sampling that is in-between these two options.
> You can do reservoir sampling to get a fair sample or you can to random
> replacement which weights the recent past more heavily than the far past.
> Both of these are attractive for various reasons.  The strongest argument
> for recency weighted sampling is probably that it is hard to decide between
> (1) and (2).
> 
> As stated above, however, this probably doesn't much matter since the
> sampling being done in the off-line analysis is mostly only applied to
> crazy users or stuff so popular that any sample will do.
> 

Reply via email to