I still don't understand why we need to rely on docids. If we simply index that row A is similar to rows B, C and D that should be fine, or am I wrong?
2013/8/5 Pat Ferrel <[email protected]> > I think m/r join is the best solution, too many assumptions otherwise. I > thought Ted wanted a non-m/r implementation, but oh, well, mostly non-m/r. > Is there a good example to start from in Mahout? > > Yes, one id field per doc. The problem is not storing, it is joining rows > from two DRMs by simple iteration. > > On Aug 5, 2013, at 10:27 AM, Sebastian Schelter <[email protected]> wrote: > > If you use the same partitioning and number of reducers for creating the > outputs, the output should have the same number of sequence files and each > sequence file should have the same keys in descending order. I don't > understand why the ordering is a problem, can we not store the row index as > a field in solr? > > 2013/8/5 Ted Dunning <[email protected]> > > > A quick map-reduce program should be able to join these matrices and > > produce documents ready to index. > > > > > > On Mon, Aug 5, 2013 at 10:10 AM, Pat Ferrel <[email protected]> > wrote: > > > >> In writing the similarity matrices to Solr there is a bit of a problem. > >> The Matrices exist in two DRMs. The rows correspond to the doc IDs. As > > far > >> as I know there is no guarantee that the ids of both matrices are in the > >> same descending order. > >> > >> The easiest solution is to have an index for [B'B] and one for [B'A]. > > That > >> means two or perhaps three queries for cross-recommendations, which is > > not > >> ideal. > >> > >> First I'm going to create two collections of docs with different field > >> ids--this should work and we can merge them later. > >> > >> Next we can do some m/r to group the docs by id so there is one > > collection > >> (csv) with one line per doc. > >> > >> Alternatively it is a possible that the DRMs can be iterated > >> simultaneously, which would also solve the problem. It assumes the order > > in > >> both DRMs is the same, descending by Key = item ID. Even if a row is > >> missing in one or the other this would work. > >> > >> Does anyone know if the DRMs are guaranteed to have row ordering by Key? > >> RSJ creates [B'B] and matrix multiply creates [B'A] > >> > >> > >> On Aug 2, 2013, at 11:14 PM, Ted Dunning <[email protected]> wrote: > >> > >> Yes. We need two different sets of documents if the row space of the > >> cross/co-occurrence matrices are different as is the case with A'B and > > B'B. > >> > >> This could mean two indexes. > >> > >> Or a single index with a special field to indicate what type of record > > you > >> have. > >> > >> > >> On Fri, Aug 2, 2013 at 2:39 PM, Pat Ferrel <[email protected]> > > wrote: > >> > >>> Thanks, well put. > >>> > >>> In order to have the ultimate impl with two id spaces for A and B would > >> we > >>> have to create different docs for A'B and B'B? Since the docs IDs must > >> come > >>> from A or B? The fields can contain different sets of IDs but the Doc > > ID > >>> must be one or the other, right? Doesn't this imply separate indexes > > for > >>> the separate A, B item IDs spaces? This is not a question for this > > first > >>> cut impl but is a generalization question. > >>> > >>> On Aug 2, 2013, at 2:06 PM, Ted Dunning <[email protected]> wrote: > >>> > >>> So there is a lot of good discussion here and there were some key > > ideas. > >>> > >>> The first idea is that the *input* to a recommender is on the right in > >> the > >>> matrix notation. This refers inherently to the id's on the columns of > >> the > >>> recommender product (either B'B or B'A). The columns are defined by > > the > >>> right hand element of the product (either B or A in the B'B and B'A > >>> respectively). > >>> > >>> The results are in the row space and are defined by the left hand > > operand > >>> of the product. IN the case of B'A and B'B, the left hand operand is B > >> in > >>> both cases so the row space is consistent. > >>> > >>> In order to implement this in a search engine, we need documents that > >>> correspond to rows of B'A or B'B. These are the same as the columns of > >> B. > >>> The fields of the documents will necessarily include the following: > >>> > >>> id: the column id from B corresponding to this item > >>> description: presentation info ... yada yada > >>> b-a-links: contents of this row of B'A expressed as id's from the > > column > >>> space of A where this row of llr-filter(B'A) contains > > a > >>> non-zero value. > >>> b-b-links: contents of this row of B'B expressed as id's from the > > column > >>> space of B ... > >>> > >>> > >>> The following operations are now single queries: > >>> > >>> get an item where id = x > >>> query is [id:x] > >>> > >>> recommend based on behavior with regard to A items and actions h_a > >>> query is [b-a-links: h_a] > >>> > >>> recommend based on behavior with regard to B items and actions h_b > >>> query is [b-b-links: h_b] > >>> > >>> recommend based on a single item with id = x > >>> query is [b-b-links: x] > >>> > >>> recommend based on composite behavior composed of h_a and h_b > >>> query is [b-a-links: h_a b-b-links: h_b] > >>> > >>> Does this make sense by being more explicit? > >>> > >>> Now, it is pretty clear that we could have an index of A objects as > > well > >>> but the link fields would have to be a-a-links and a-b-links, of > > course. > >>> > >>> > >>> > >>> > >>> On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel <[email protected]> > > wrote: > >>> > >>>> Assuming Ted needs to call it, not sure if an invite has gone out, I > >>>> haven't seen one. > >>>> > >>>> On Aug 2, 2013, at 12:49 PM, B Lyon <[email protected]> wrote: > >>>> > >>>> I am planning on sitting in as flaky connection allows. > >>>> On Aug 2, 2013 3:21 PM, "Pat Ferrel" <[email protected]> wrote: > >>>> > >>>>> We doing a hangout at 2 on the Solr recommender? > >>>>> > >>>> > >>>> > >>> > >>> > >> > >> > > > >
