I'll refresh my copy of the trunk and look into it. If this happens a lot I'll 
put my version of Mahout on github until it settles down.

Had to copy the code for a couple Mahout classes like Recommender and 
ToItemsVectorReducer to get access to private statics, no substantive changes. 
I haven't had time to compare them to the latest code in the Mahout trunk but 
will or you can. If we get any of this in the trunk we can get those made 
public and remove the need to copy the code.

As to downsampling. I have a bunch of thoughts on this. 
1) I think the downsampling is happening properly in the code you have. There 
is a limit passed to the PreparePreferenceMatrixJob *and* the new 
PrepareActionMatricesJob. Both use the truncation limit but the driver job 
doesn't pass it in yet. You can change that if needed, not a first priority for 
me but let me know if you need it right away.
2) This is not ideal way to downsample if I understand the code. It keeps the 
first items ingested which has nothing to do with their timestamp. You'd 
ideally want to truncate based on the order the actions were taken by the user 
keeping the newest.
3) I am ingesting log file data based on the order it is seen in the files and 
filesystem. If multiple files are found there is no guarantee that this has 
anything to do with order taken by users. The mined data I have from a movie 
site this order is completely arbitrary, may even be partially alphabetic based 
on the item name.
4) While I understand the need to truncate the history vector used for the 
query = h_p, the training data in my experience is not so obviously good to 
truncate. This result may be different for other types of data (Ted mentions 
contrary results with music preferences) but we did tests using Map on 
e-commerce big-box store data with lots of categories as the training data 
truncated at 3mo, 6mo, 9mo, and 12mo. And found the best Map at 12 months. The 
effect was diminishing. However truncating the history used as query may makes 
sense to get the effects of recent actions rather than ancient history. Mahout 
conflates the two histories and truncates them both. Sebastian recently added 
downsampling to RowSimialrityJob so it will get done on [B'B] but not [B'A] or 
the user's history as ingested. If he took it out of PerparePrefernceMatrixJob 
is will not be done on ingested history. Best ask him.
5) With Solr TFIDF is used I believe so items that are very popular will be 
down-weighted as a side effect rather than downsampled. This is probably a good 
thing.
6) The Solr recommender will downsample the training data using Mahout 
preferences per user limit. It can truncate the query vector separately since 
[B'B] comes from Mahout and the query may be constructed at runtime from very 
recent user activity or taken from the mahout history.The equation R_a1 = 
[B'B]H_a1 could be written as [B'B]B' for the mahout recommender since the same 
history vectors are used in training and query (unless Sebastian has changed 
this). More generally we write [B'B]H_a2 to allow for truncating/downsampling 
of H_a2 (user history in columns) differently from B (user history in rows) 
which may not need to be downsampled much if at all. 

A little off subject but a final thought on the use of recommenders by the 
experts (Amazon, Netflix). They generally do not train on all categories of 
items in one model (as we did in the example above). They segment or cluster 
their items then train and recommend using a derived category as the context. 
That is why you see rows on their home pages relating to different categories 
of things you recently bought or watched. The use of context of some type is a 
very important thing imo.


On Aug 3, 2013, at 9:48 PM, B Lyon <[email protected]> wrote:

Hi Pat

I was going to just play with building the sold-recommender stuff in its 
current wip state and noticed a compile error (running mvn install) I think 
because the 0.9 snapshot has some changes on July 30th

http://svn.apache.org/viewvc?view=revision&revision=1508302

Basically, back on June 18, Ted noticed that the downsampling might not be 
being done at the right place to actually avoid overwork due to "perversely 
prolific users" (thread is here: 
http://web.archiveorange.com/archive/v/z6zxQatCzHoFxbdLF0of), and someone else 
(Sebastian Schelter) has already acted on this (July 30) to move the 
downsampling to somewhere else (Mahout-1289 - 
https://issues.apache.org/jira/browse/MAHOUT-1289), which (among other things) 
removes the SAMPLE_SIZE static variable from ToItemVectorsMapper.  I don't know 
how the general changes affect what you were setting up/playing with.  Let me 
know if I've missed something here.



-- 
BF Lyon
http://www.nowherenearithaca.com

Reply via email to