Not following so…
Here so is what I've done in probably too much detail:
1) ingest raw log files and split them up by action
2) turn these into Mahout preference files using Mahout type IDs, keeping a map
of IDs
3) run the Mahout Item-based recommender using LLR for similarity
4) created a Mahout style cross-recommender using cooccurrence similarity using
matrix math
5) given two similairty matrixes and a user history matrix I am writing them to
csv files with Mahout ID replaced by the original string external IDs for users
and items
input log file before splitting:
u1 purchase iphone
u1 purchase ipad
u2 purchase nexus-tablet
u2 purchase galaxy
u3 purchase surface
u4 purchase iphone
u4 purchase ipad
u1 view iphone
u1 view ipad
u1 view nexus-tablet
u1 view galaxy
u2 view iphone
u2 view ipad
u2 view nexus-tablet
u2 view galaxy
u3 view surface
u4 view iphone
u4 view ipad
u4 view nexus-tablet
Input user history DRM after ID translation to mahout IDs and splitting for
action "purchase"
B user/item iphone ipad nexus-tablet galaxy surface
u1 1 1 0 0 0
u2 0 0 1 1 0
u3 0 0 0 0 1
u4 1 1 0 0 0
Map of IDs Mahout to Original/External
0 -> iphone
1 -> ipad
2 -> nexus-tablet
3 -> galaxy
4 -> surface
To be specific the DRM from the RecommenderJob with item-item similarities
using LLR looks like this:
Input Path: out/p-recs/sims/part-r-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.math.VectorWritable
Key: 0: Value: {1:0.8472157541208549}
Key: 1: Value: {0:0.8472157541208549}
Key: 2: Value: {3:0.8181382096075936}
Key: 3: Value: {2:0.8181382096075936}
Key: 4: Value: {}
This will be written to a directory for later Solr indexing as a csv of the
form:
item_id,similar_items,cross_action_similar_items
iphone,ipad,
ipad,iphone,
nexus-tablet,galaxy,
galaxy, nexus-tablet,
surface,,
By using a user's history vector as a query you get results = recommendations
So if the user is u1, the history vector is:
"iphone ipad"
The Solr results for query "iphone ipad" using field "similar_items" will be
1. Doc ID, ipad
2. Doc ID, iphone
If you want item similarities, for instance if a user is anonymous with no
history and is looking at an iphone product page. You would fetch the doc for
id = "iphone" and get:
"ipad"
Perhaps a bad example for ordering, since there is only one ID in the doc but
the items in the "similar_items" field would be ordered by similarity strength.
Likewise for the cross-action similarities though the matrix will have
cooccurrence [B'A] values in the DRM.
For item similarities there is no need to do more than fetch one doc that
contains the similarities, right? I've successfully used this method with the
Mahout recommender but please correct me if something above is wrong.
On Jul 31, 2013, at 4:52 PM, Ted Dunning <[email protected]> wrote:
Pat,
See inline
On Wed, Jul 31, 2013 at 1:29 PM, Pat Ferrel <[email protected]> wrote:
> So the XML as CSV would be:
> item_id,similar_items,cross_action_similar_items
> ipad,iphone,iphone nexus
> iphone,ipad,ipad galaxy
>
Right. Doesn't matter what format. Might want quotes around space
delimited lists, but anything will do.
>
> Note: As I mentioned before the order of the items in the field will
> encode rank of the similarity strength. This is for cases where you want to
> find similar items to a context item. You would fetch the doc for the
> context item by it's item ID and show the top k items in the doc. Ted's
> caveat would probably be to dither them.
>
I always say "dither" so that is an easy one.
But fetching similar items of a center item by fetching the center item and
then fetching each of the referenced items is typically slower by about 2x
than running the search for mentions of the center item.
> Sounds like Ted is generating data. Andrew or M Lyon do either of you want
> to set the demo system up? If so you'll need to find a system--free tier
> AWS, Ted's box, etc. Then install all the needed stuff.
>
> I'll get the output working to csv.
>
> On Jul 31, 2013, at 11:51 AM, Pat Ferrel <[email protected]> wrote:
>
> OK and yes. The docs will look like:
>
> <add>
> <doc>
> <field name='item_id'>ipad</field>
> <field name='similar_items'>iphone</field>
> <field name='cross_action_similar_items'>iphone nexus</field>
> </doc>
> <doc>
> <field name='item_id'>iphone</field>
> <field name='similar_items'>ipad</field>
> <field name='cross_action_similar_items'>ipad galaxy</field>
> </doc>
> </add>
>
>
> On Jul 31, 2013, at 11:42 AM, B Lyon <[email protected]> wrote:
>
> I'm interested in helping as well.
> Btw I thought that what was stored in the solr fields were the llr-filtered
> items (ids I guess) for the could-be-recommended things.
>
>