I still don't understand why we need to rely on docids. If we simply index
that row A is similar to rows B, C and D that should be fine, or am I wrong?

2013/8/5 Pat Ferrel <[email protected]>

> I think m/r join is the best solution, too many assumptions otherwise. I
> thought Ted wanted a non-m/r implementation, but oh, well, mostly non-m/r.
> Is there a good example to start from in Mahout?
>
> Yes, one id field per doc. The problem is not storing, it is joining rows
> from two DRMs by simple iteration.
>
> On Aug 5, 2013, at 10:27 AM, Sebastian Schelter <[email protected]> wrote:
>
> If you use the same partitioning and number of reducers for creating the
> outputs, the output should have the same number of sequence files and each
> sequence file should have the same keys in descending order. I don't
> understand why the ordering is a problem, can we not store the row index as
> a field in solr?
>
> 2013/8/5 Ted Dunning <[email protected]>
>
> > A quick map-reduce program should be able to join these matrices and
> > produce documents ready to index.
> >
> >
> > On Mon, Aug 5, 2013 at 10:10 AM, Pat Ferrel <[email protected]>
> wrote:
> >
> >> In writing the similarity matrices to Solr there is a bit of a problem.
> >> The Matrices exist in two DRMs. The rows correspond to the doc IDs. As
> > far
> >> as I know there is no guarantee that the ids of both matrices are in the
> >> same descending order.
> >>
> >> The easiest solution is to have an index for [B'B] and one for [B'A].
> > That
> >> means two or perhaps three queries for cross-recommendations, which is
> > not
> >> ideal.
> >>
> >> First I'm going to create two collections of docs with different field
> >> ids--this should work and we can merge them later.
> >>
> >> Next we can do some m/r to group the docs by id so there is one
> > collection
> >> (csv) with one line per doc.
> >>
> >> Alternatively it is a possible that the DRMs can be iterated
> >> simultaneously, which would also solve the problem. It assumes the order
> > in
> >> both DRMs is the same, descending by Key = item ID. Even if a row is
> >> missing in one or the other this would work.
> >>
> >> Does anyone know if the DRMs are guaranteed to have row ordering by Key?
> >> RSJ creates [B'B] and matrix multiply creates [B'A]
> >>
> >>
> >> On Aug 2, 2013, at 11:14 PM, Ted Dunning <[email protected]> wrote:
> >>
> >> Yes.  We need two different sets of documents if the row space of the
> >> cross/co-occurrence matrices are different as is the case with A'B and
> > B'B.
> >>
> >> This could mean two indexes.
> >>
> >> Or a single index with a special field to indicate what type of record
> > you
> >> have.
> >>
> >>
> >> On Fri, Aug 2, 2013 at 2:39 PM, Pat Ferrel <[email protected]>
> > wrote:
> >>
> >>> Thanks, well put.
> >>>
> >>> In order to have the ultimate impl with two id spaces for A and B would
> >> we
> >>> have to create different docs for A'B and B'B? Since the docs IDs must
> >> come
> >>> from A or B? The fields can contain different sets of IDs but the Doc
> > ID
> >>> must be one or the other, right? Doesn't this imply separate indexes
> > for
> >>> the separate A, B item IDs spaces? This is not a question for this
> > first
> >>> cut impl but is a generalization question.
> >>>
> >>> On Aug 2, 2013, at 2:06 PM, Ted Dunning <[email protected]> wrote:
> >>>
> >>> So there is a lot of good discussion here and there were some key
> > ideas.
> >>>
> >>> The first idea is that the *input* to a recommender is on the right in
> >> the
> >>> matrix notation.  This refers inherently to the id's on the columns of
> >> the
> >>> recommender product (either B'B or B'A).  The columns are defined by
> > the
> >>> right hand element of the product (either B or A in the B'B and B'A
> >>> respectively).
> >>>
> >>> The results are in the row space and are defined by the left hand
> > operand
> >>> of the product.  IN the case of B'A and B'B, the left hand operand is B
> >> in
> >>> both cases so the row space is consistent.
> >>>
> >>> In order to implement this in a search engine, we need documents that
> >>> correspond to rows of B'A or B'B.  These are the same as the columns of
> >> B.
> >>> The fields of the documents will necessarily include the following:
> >>>
> >>> id: the column id from B corresponding to this item
> >>> description: presentation info ... yada yada
> >>> b-a-links: contents of this row of B'A expressed as id's from the
> > column
> >>> space of A where this row                  of llr-filter(B'A) contains
> > a
> >>> non-zero value.
> >>> b-b-links: contents of this row of B'B expressed as id's from the
> > column
> >>> space of B ...
> >>>
> >>>
> >>> The following operations are now single queries:
> >>>
> >>> get an item where id = x
> >>>     query is [id:x]
> >>>
> >>> recommend based on behavior with regard to A items and actions h_a
> >>>     query is [b-a-links: h_a]
> >>>
> >>> recommend based on behavior with regard to B items and actions h_b
> >>>     query is [b-b-links: h_b]
> >>>
> >>> recommend based on a single item with id = x
> >>>      query is [b-b-links: x]
> >>>
> >>> recommend based on composite behavior composed of h_a and h_b
> >>>      query is [b-a-links: h_a b-b-links: h_b]
> >>>
> >>> Does this make sense by being more explicit?
> >>>
> >>> Now, it is pretty clear that we could have an index of A objects as
> > well
> >>> but the link fields would have to be a-a-links and a-b-links, of
> > course.
> >>>
> >>>
> >>>
> >>>
> >>> On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel <[email protected]>
> > wrote:
> >>>
> >>>> Assuming Ted needs to call it, not sure if an invite has gone out, I
> >>>> haven't seen one.
> >>>>
> >>>> On Aug 2, 2013, at 12:49 PM, B Lyon <[email protected]> wrote:
> >>>>
> >>>> I am planning on sitting in as flaky connection allows.
> >>>> On Aug 2, 2013 3:21 PM, "Pat Ferrel" <[email protected]> wrote:
> >>>>
> >>>>> We doing a hangout at 2 on the Solr recommender?
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
>
>

Reply via email to