I am wondering about row/column confusion as well - fleshing out the
doc/design with more specifics (which Pat is kind of doing, basically)
should make things obvious eventually, imo.
The way Pat had phrased it got me to wondering what rationale you use to
rank the results when you are querying the columns ("similar column",
"similar via action 2 column", etc.).
He had mentioned the auxiliary case of simply getting most similar items to
a given docid by just going to the row for that docid and using the
pre-sorted values in the "similar column", and I thought Ted might have
hinted that you could just as well do a solr query of the column with that
single docid as the query; however, in the latter case I wonder if the
order and list itself could be weird, as some items may show up simply
because they are not similar to many things: lower LLR values that got
filtered in the list for docid itself won't get filtered when you're
looking at the other "not similar to very many items" things when
generating their list for the solr field.. I guess using an absolute
cutoff for LLR in the filtering could deal with some of this issue. All
hypothetical at the moment (for me, anyway), as "real" data might trivially
dismiss some of these concerns as irrelevant.
I think the hangout is a good idea, too, btw, and hope to be able to sit in
if it happens. Very excited about this approach.
On Thu, Aug 1, 2013 at 6:03 PM, Ted Dunning <[email protected]> wrote:
> On Thu, Aug 1, 2013 at 11:58 AM, Pat Ferrel <[email protected]> wrote:
>
> > Sorry to be dense but I think there is some miscommunication. The most
> > important question is: am I writing the item-item similarity matrix DRM
> out
> > to Solr, one row = one Solr doc?
>
>
> Each row = one *field* in a Solr doc. Different DRM's produce different
> fields in the same docs.
>
> There will also be item meta-data in the field.
>
>
> > For the mapreduce Mahout Item-based recommender this is in
> > "tmp/similarityMatrix". If not then please stop me. If I'm off base here,
> > maybe a skype or im session will straighten me out.
> [email protected]
> > [email protected]
>
>
> Actually, that is a grand idea. Let's do a hangout.
>
> From the who-is-free-when<
> https://docs.google.com/forms/d/1skIaqe0CBWO4qemTyHCZwS40YjXJ9FeLCqwV8cw4Gno/viewform
> >survey,
> it looks like lots of people are available tomorrow at 2PM PDT.
>
> Would that work?
>
> To be clear below I'm not talking about history based recs, which is the
> > primary use case. I am talking about a query that does not use history,
> > that only finds similar items based on training data. The item-item
> > similarity matrix DRM contains Key = item ID, Value = list of item IDs
> with
> > similarity strengths.
> >
>
> Yes. I absolutely agree that you can do this.
>
> These should, strictly speaking, be columns in the item-item matrix. The
> item-item matrix may or may not be symmetric. If it is symmetric, then
> column or row doesn't matter.
>
>
> > This is equivalent to the list returned by ItemBasedRecommender's
> > public List<RecommendedItem> mostSimilarItems(long itemID, int howMany)
> > throws TasteException
> >
>
> Yes.
>
>
> > Specified by:
> > mostSimilarItems in interface ItemBasedRecommender
> >
> > Parameters:
> > itemID - ID of item for which to find most similar other items
> > howMany - desired number of most similar items to find
> >
> > Returns:
> > items most similar to the given item, ordered from most similar to least
> >
> > To get the list from Solr you would fetch the doc associated with
> > "itemID", no?
> >
>
> If you store the column, then yes.
>
> If you store the row, then using a query on the field containing the
> similar items is the right answer.
>
> The key difference that I have is what happens in the next step.
>
> When using the Mahout mapreduce item-based recommender we get the
> > similarity matrix and do just that. We get the row associated with the
> > Mahout itemID and recommend the top k items from the vector. This
> performs
> > well in cross-validation tests.
> >
>
> Good.
>
> I think that there is a row/column confusion here, but they are probably
> nearly identical in your application.
>
> The key point is what happens *after* you do the query that you are
> suggesting.
>
> In your case, you have to retrieve the meta-data associated with each of
> related items. I like to store this meta-data in a Solr field (or three)
> so this involves at least one additional query. You can automatically
> chain this second query by using the "join" operation that Solr provides,
> but the second query still happens.
>
> If you do the query the way that I suggest, this second query doesn't need
> to happen. You get the meta-data directly.
>
>
>
>
>
> >
> >
> >
> > On Aug 1, 2013, at 9:49 AM, Ted Dunning <[email protected]> wrote:
> >
> > On Thu, Aug 1, 2013 at 8:46 AM, Pat Ferrel <[email protected]>
> wrote:
> >
> > >
> > > For item similarities there is no need to do more than fetch one doc
> that
> > > contains the similarities, right? I've successfully used this method
> with
> > > the Mahout recommender but please correct me if something above is
> wrong.
> >
> >
> > No.
> >
> > First, you need to retrieve all the other documents that are referenced
> to
> > get their display meta-data. So this isn't just a one document fetch.
> >
> > Second, the similar items point inwards, not outwards. Thus, the query
> you
> > want has the id of the current item and searches the similar_items field.
> > The result of that search is all of the similar items.
> >
> > The confusion here may stem from the name of the field. A name like
> > "linked-from-items" or some such might help here.
> >
> >
> > Another way to look at this is that there should be no procedural
> > difference if you have 10 items or 20 in your history. Either way, your
> > history is a query against the appropriate link fields. Likewise, there
> > should be no difference between having 10 items or 2 items in your
> history.
> > There shouldn't even be any difference if you have even just 1 item in
> > your history.
> >
> > Finding items similar to a single item is exactly like having 1 item in
> > your history. So that should be done by searching with that one item in
> > the appropriate link fields.
> >
> >
>
--
BF Lyon
http://www.nowherenearithaca.com