I think m/r join is the best solution, too many assumptions otherwise. I thought Ted wanted a non-m/r implementation, but oh, well, mostly non-m/r. Is there a good example to start from in Mahout?
Yes, one id field per doc. The problem is not storing, it is joining rows from two DRMs by simple iteration. On Aug 5, 2013, at 10:27 AM, Sebastian Schelter <[email protected]> wrote: If you use the same partitioning and number of reducers for creating the outputs, the output should have the same number of sequence files and each sequence file should have the same keys in descending order. I don't understand why the ordering is a problem, can we not store the row index as a field in solr? 2013/8/5 Ted Dunning <[email protected]> > A quick map-reduce program should be able to join these matrices and > produce documents ready to index. > > > On Mon, Aug 5, 2013 at 10:10 AM, Pat Ferrel <[email protected]> wrote: > >> In writing the similarity matrices to Solr there is a bit of a problem. >> The Matrices exist in two DRMs. The rows correspond to the doc IDs. As > far >> as I know there is no guarantee that the ids of both matrices are in the >> same descending order. >> >> The easiest solution is to have an index for [B'B] and one for [B'A]. > That >> means two or perhaps three queries for cross-recommendations, which is > not >> ideal. >> >> First I'm going to create two collections of docs with different field >> ids--this should work and we can merge them later. >> >> Next we can do some m/r to group the docs by id so there is one > collection >> (csv) with one line per doc. >> >> Alternatively it is a possible that the DRMs can be iterated >> simultaneously, which would also solve the problem. It assumes the order > in >> both DRMs is the same, descending by Key = item ID. Even if a row is >> missing in one or the other this would work. >> >> Does anyone know if the DRMs are guaranteed to have row ordering by Key? >> RSJ creates [B'B] and matrix multiply creates [B'A] >> >> >> On Aug 2, 2013, at 11:14 PM, Ted Dunning <[email protected]> wrote: >> >> Yes. We need two different sets of documents if the row space of the >> cross/co-occurrence matrices are different as is the case with A'B and > B'B. >> >> This could mean two indexes. >> >> Or a single index with a special field to indicate what type of record > you >> have. >> >> >> On Fri, Aug 2, 2013 at 2:39 PM, Pat Ferrel <[email protected]> > wrote: >> >>> Thanks, well put. >>> >>> In order to have the ultimate impl with two id spaces for A and B would >> we >>> have to create different docs for A'B and B'B? Since the docs IDs must >> come >>> from A or B? The fields can contain different sets of IDs but the Doc > ID >>> must be one or the other, right? Doesn't this imply separate indexes > for >>> the separate A, B item IDs spaces? This is not a question for this > first >>> cut impl but is a generalization question. >>> >>> On Aug 2, 2013, at 2:06 PM, Ted Dunning <[email protected]> wrote: >>> >>> So there is a lot of good discussion here and there were some key > ideas. >>> >>> The first idea is that the *input* to a recommender is on the right in >> the >>> matrix notation. This refers inherently to the id's on the columns of >> the >>> recommender product (either B'B or B'A). The columns are defined by > the >>> right hand element of the product (either B or A in the B'B and B'A >>> respectively). >>> >>> The results are in the row space and are defined by the left hand > operand >>> of the product. IN the case of B'A and B'B, the left hand operand is B >> in >>> both cases so the row space is consistent. >>> >>> In order to implement this in a search engine, we need documents that >>> correspond to rows of B'A or B'B. These are the same as the columns of >> B. >>> The fields of the documents will necessarily include the following: >>> >>> id: the column id from B corresponding to this item >>> description: presentation info ... yada yada >>> b-a-links: contents of this row of B'A expressed as id's from the > column >>> space of A where this row of llr-filter(B'A) contains > a >>> non-zero value. >>> b-b-links: contents of this row of B'B expressed as id's from the > column >>> space of B ... >>> >>> >>> The following operations are now single queries: >>> >>> get an item where id = x >>> query is [id:x] >>> >>> recommend based on behavior with regard to A items and actions h_a >>> query is [b-a-links: h_a] >>> >>> recommend based on behavior with regard to B items and actions h_b >>> query is [b-b-links: h_b] >>> >>> recommend based on a single item with id = x >>> query is [b-b-links: x] >>> >>> recommend based on composite behavior composed of h_a and h_b >>> query is [b-a-links: h_a b-b-links: h_b] >>> >>> Does this make sense by being more explicit? >>> >>> Now, it is pretty clear that we could have an index of A objects as > well >>> but the link fields would have to be a-a-links and a-b-links, of > course. >>> >>> >>> >>> >>> On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel <[email protected]> > wrote: >>> >>>> Assuming Ted needs to call it, not sure if an invite has gone out, I >>>> haven't seen one. >>>> >>>> On Aug 2, 2013, at 12:49 PM, B Lyon <[email protected]> wrote: >>>> >>>> I am planning on sitting in as flaky connection allows. >>>> On Aug 2, 2013 3:21 PM, "Pat Ferrel" <[email protected]> wrote: >>>> >>>>> We doing a hangout at 2 on the Solr recommender? >>>>> >>>> >>>> >>> >>> >> >> >
