+1 on vector properties On Aug 4, 2013, at 5:34 PM, Pat Ferrel <[email protected]> wrote:
> It does bring up a nice way to order the items in the A and B docs, by > timestamp if available. That way when you get an h_b doc from B for the query: > > recommend based on behavior with regard to B items and actions h_b > query is [b-b-links: h_b] > > the h_b items are ordered by recency. You can truncate based on the number of > actions you want to consider. This should be very easy to implement if only > we could attach data to the items in the DRMs > > Actually this brings up another point that I've harped on before. It sure > would be nice to have a vector representation where you could attache > arbitrary data to items or vectors. Not so memory efficient but it makes > things like ID translation and timestamping actions trivial. If these could > be attached and survive all the Mahout jobs there would be no need for the > in-memory hashmap I'm using to translate IDs and the actions could be > timestamped or other metadata could be attached. At present I guess everyone > knows that only weights are attached to actions/matrix values and in some > cases names to rows/vectors in DRMs. > > > On Aug 4, 2013, at 12:59 PM, Ted Dunning <[email protected]> wrote: > > On Sun, Aug 4, 2013 at 9:35 AM, Pat Ferrel <[email protected]> wrote: > >> 2) This is not ideal way to downsample if I understand the code. It keeps >> the first items ingested which has nothing to do with their timestamp. >> You'd ideally want to truncate based on the order the actions were taken by >> the user keeping the newest. > > > > There are at least three options for down-sampling. All have arguments in > their favor and probably have good applications. I don't think it actually > matters, however, since down-sampling should mostly be applied to > pathological cases like bots or QA teams. > > The options that I know of include: > > 1) take the first events you see. This is easy. For content, it may be > best to do this because this gives you information about the context of the > content when it first appears. For users, this may be worst as a > characterization of the user now, but it may be near best for the off-line > item-item analysis because it preserves a densely sampled view of some past > moment in time. > > 2) take the last events you see. This is also easy, but not quite as easy > as (1) since you can't stop early if you see the data in chronological > order. For content, this gives you the latest view of the content and > pushes all data for all items into the same time frame which might increase > overlap in the offline analysis. For users at recommendation, it is > probably exactly what you want. > > 3) take some time-weighted sampling that is in-between these two options. > You can do reservoir sampling to get a fair sample or you can to random > replacement which weights the recent past more heavily than the far past. > Both of these are attractive for various reasons. The strongest argument > for recency weighted sampling is probably that it is hard to decide between > (1) and (2). > > As stated above, however, this probably doesn't much matter since the > sampling being done in the off-line analysis is mostly only applied to > crazy users or stuff so popular that any sample will do. >
