[jira] Issue Comment Edited: (SOLR-651) A SearchComponent for fetching TF-IDF values

Yonik Seeley (JIRA) Tue, 28 Oct 2008 10:24:06 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643270#action_12643270
 ]


[EMAIL PROTECTED] edited comment on SOLR-651 at 10/28/08 10:23 AM:
--------------------------------------------------------------

Some random thoughts on this patch:
 - Adding the uniqueKeyFieldName seems out of place.... it's just one element 
of the schema and it doesn't seem like it belongs in this component.
 - How about using the "id" as the key, as is done in other places like 
highlighting.
  So instead of 
{code}
  <lst name="doc-170">
    <str name="uniqueKey">3007WFP</str>
    <lst name="cat">
        <lst name="electronics"/>
        <lst name="monitor"/>
   </lst>
 </lst>
{code}
it could look like
{code}
  <lst name="3007WFP">
    <lst name="cat">
        <lst name="electronics"/>
        <lst name="monitor"/>
   </lst>
 </lst>
{code}
- It doesn't seem like we should link the ability to return term vectors with 
term vectors being stored.  Like highlighting, they should be used when 
available for speed, but stored fields should also be possible.  It's fine for 
the impl of that to wait, but perhaps the interface should support that via a 
tv.fl parameter.  update: just looked at the code again, and I see there is a 
tv.fl param.... so I guess the only discussion point is if the default is right 
(all fields with term vectors stored).
- "idf" actually isn't the idf, it's the doc freq that is being returned.  The 
label should probably be changed to "df"
- instead of "freq", how about just using the shorter and well-known "tf"?
- the docs say that tf_idf "Calculates tf*idf for each term.", but the code is 
actually returning "freq"/"idf" (but the idf is actually a df, so it is a 
straight tf * idf).  *But* this doesn't seem that useful because the user could 
trivially do tf/df themselves.  What would seem useful is to get the actual 
scoring tf-idf (via the Similarity).  For better language mappings, I think we 
should avoid dashes in parameter names too.... perhaps tv.tfidf or tv.tf_idf?


      was (Author: [EMAIL PROTECTED]):
    Some random thoughts on this patch:
 - Adding the uniqueKeyFieldName seems out of place.... it's just one element 
of the schema and it doesn't seem like it belongs in this component.
 - How about using the "id" as the key, as is done in other places like 
highlighting.
  So instead of 
{code}
  <lst name="doc-170">
    <str name="uniqueKey">3007WFP</str>
    <lst name="cat">
        <lst name="electronics"/>
        <lst name="monitor"/>
   </lst>
 </lst>
{code}
it could look like
{code}
  <lst name="3007WFP">
    <lst name="cat">
        <lst name="electronics"/>
        <lst name="monitor"/>
   </lst>
 </lst>
{code}
- It doesn't seem like we should link the ability to return term vectors with 
term vectors being stored.  Like highlighting, they should be used when 
available for speed, but stored fields should also be possible.  It's fine for 
the impl of that to wait, but perhaps the interface should support that via a 
tv.fl parameter.
- "idf" actually isn't the idf, it's the doc freq that is being returned.  The 
label should probably be changed to "df"
- instead of "freq", how about just using the shorter and well-known "tf"?
- the docs say that tf_idf "Calculates tf*idf for each term.", but the code is 
actually returning "freq"/"idf" (but the idf is actually a df, so it is a 
straight tf * idf).  *But* this doesn't seem that useful because the user could 
trivially do tf/df themselves.  What would seem useful is to get the actual 
scoring tf-idf (via the Similarity).  For better language mappings, I think we 
should avoid dashes in parameter names too.... perhaps tv.tfidf or tv.tf_idf?

  
> A SearchComponent for fetching TF-IDF values
> --------------------------------------------
>
>                 Key: SOLR-651
>                 URL: https://issues.apache.org/jira/browse/SOLR-651
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 1.3
>            Reporter: Noble Paul
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: SOLR-651-fixes.patch, SOLR-651.patch, SOLR-651.patch, 
> SOLR-651.patch, SOLR-651.patch, SOLR-651.patch, SOLR-651.patch
>
>
> A SearchComponent that can return TF-IDF vector for any given document in the 
> SOLR index
> Query : A Document Number / a query identifying a Document
> Response :  A Map of term vs.TF-IDF value of every term in the Selected
> Document
> Why ?
> Most of the Machine Learning Algorithms work on TFIDF representation of
> documents, hence adding a Request Handler proving the TFIDF representation
> will pave the way for incorporating Learning Paradigms to SOLR framework.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-651) A SearchComponent for fetching TF-IDF values

Reply via email to