Re: Setting up a recommender

Michael Sokolov Mon, 22 Jul 2013 19:05:49 -0700

Fair enough - thanks for clarifying. I wondered whether that would beworth the trouble, also. Maybe one the academics Pat mentioned willtest and find out for us :)


On 7/22/13 6:45 PM, Ted Dunning wrote:

Not entirely without regard to weight. Just without regard todesigning weights specific to this application. The weights that Solruses natively are intuitively what we want (rare indicators havehigher weights in a log-ish kind of way).

Frankly, I doubt the effectiveness here of mathematical reasoning forgetting a better weighting. The deviations from optimal relative tothe Solr defaults are probably as large as the deviations from theassumptions that the mathematically motivated weightings are based on.Fixing this is spending a lot for small potatoes. Fixing the dataflow and getting access to more data is far higher value.

On Mon, Jul 22, 2013 at 12:18 PM, Michael Sokolov<[email protected]<mailto:[email protected]>> wrote:


    So you are proposing just grabbing the top N scoring related items
    and indexing listing them without regard to weight?  Effectively
    quantizing the weights to = 1, and 0 for everything else?  I guess
    LLR tends to do that anyway

    -Mike


    On 07/22/2013 02:57 PM, Ted Dunning wrote:

        My experience is that TFIDF works just fine, especially as
        first cut.

        Adding different kinds of data, building out backend A/B
        testing, tuning
        the UI, weighting the query all come the next round of
        weighting changes.
          Typically, the priority stack never empties enough for that
        task to rise
        to the top.


        On Mon, Jul 22, 2013 at 11:07 AM, Michael Sokolov <
        [email protected]
        <mailto:[email protected]>> wrote:

            On 07/22/2013 12:20 PM, Pat Ferrel wrote:

                My understanding of the Solr proposal puts B's row
                similarity matrix in a
                vector per item. That means each row is turned into
                "terms" = external
                IDs--not sure how the weights of each term are encoded.

            This is the key question for me. The best idea I've had is
            to use termFreq
            as a proxy for weight.  It's only an integer, so there are
            scaling issues
            to consider, but you can apply a per-field weight to
            manage that.  Also,
            Lucene (and Solr) doesn't provide an obvious way to load
            term frequencies
            directly: probably the simplest thing to do is just to
            repeat the
            cross-term N times and let the text analysis take care of
            counting them.
              Inefficient, but probably the quickest way to get going.
             Alternatively,
            there are some lower level Lucene indexing APIs
            (DocFieldConsumer et al)
            which I haven't really plumbed entirely, but would allow
            for more direct
            loading of fields.

            Then one probably wants to override the scoring in some
            way (unless TFIDF
            is the way to go somehow??)

Re: Setting up a recommender

Reply via email to