This sounds like a great patch. I can help with the review and commit after
the jira is created.

Thanks!

Joel


On Fri, Oct 11, 2019 at 1:06 AM Peter Davie <
peter.da...@convergentsolutions.com.au> wrote:

> Hi,
>
> I apologise in advance for the length of this email, but I want to share
> my discovery steps to make sure that I haven't missed anything during my
> investigation...
>
> I am working on a classification project and will be using the
> classify(model()) stream function to classify documents.  I have noticed
> that models generated include many noise terms from the (lexically)
> early part of the term list.  To test, I have used the /BBC articles
> fulltext and category //dataset from Kaggle/
> (https://www.kaggle.com/yufengdev/bbc-fulltext-and-category). I have
> indexed the data into a Solr collection (news_categories) and am
> performing the following operation to generate a model for documents
> categorised as "BUSINESS" (only keeping the 100th iteration):
>
> having(
>      train(
>          news_categories,
>          features(
>              news_categories,
>              zkHost="localhost:9983",
>              q="*:*",
>              fq="role:train",
>              fq="category:BUSINESS",
>              featureSet="business",
>              field="body",
>              outcome="positive",
>              numTerms=500
>          ),
>          fq="role:train",
>          fq="category:BUSINESS",
>          zkHost="localhost:9983",
>          name="business_model",
>          field="body",
>          outcome="positive",
>          maxIterations=100
>      ),
>      eq(iteration_i, 100)
> )
>
> The output generated includes "noise" terms, such as the following
> "1,011.15", "10.3m", "01", "02", "03", "10.50", "04", "05", "06", "07",
> "09", and these terms all have the same value for idfs_ds ("-Infinity").
>
> Investigating the "features()" output, it seems that the issue is that
> the noise terms are being returned with NaN for the score_f field:
>
>      "docs": [
>        {
>          "featureSet_s": "business",
>          "score_f": "NaN",
>          "term_s": "1,011.15",
>          "idf_d": "-Infinity",
>          "index_i": 1,
>          "id": "business_1"
>        },
>        {
>          "featureSet_s": "business",
>          "score_f": "NaN",
>          "term_s": "10.3m",
>          "idf_d": "-Infinity",
>          "index_i": 2,
>          "id": "business_2"
>        },
>        {
>          "featureSet_s": "business",
>          "score_f": "NaN",
>          "term_s": "01",
>          "idf_d": "-Infinity",
>          "index_i": 3,
>          "id": "business_3"
>        },
>        {
>          "featureSet_s": "business",
>          "score_f": "NaN",
>          "term_s": "02",
>          "idf_d": "-Infinity",
>          "index_i": 4,
>          "id": "business_4"
>        },...
>
> I have examined the code within
> org/apache/solr/client/solrj/io/streamFeatureSelectionStream.java and
> see that the scores being returned by {!igain} include NaN values, as
> follows:
>
> {
>    "responseHeader":{
>      "zkConnected":true,
>      "status":0,
>      "QTime":20,
>      "params":{
>        "q":"*:*",
>        "distrib":"false",
>        "positiveLabel":"1",
>        "field":"body",
>        "numTerms":"300",
>        "fq":["category:BUSINESS",
>          "role:train",
>          "{!igain}"],
>        "version":"2",
>        "wt":"json",
>        "outcome":"positive",
>        "_":"1569982496170"}},
>    "featuredTerms":[
>      "0","NaN",
>      "0.0051","NaN",
>      "0.01","NaN",
>      "0.02","NaN",
>      "0.03","NaN",
>
> Looking intoorg/apache/solr/search/IGainTermsQParserPlugin.java, it
> seems that when a term is not included in the positive or negative
> documents, the docFreq calculation (docFreq = xc + nc) is 0, which means
> that subsequent calculations result in NaN (division by 0) which
> generates these meaningless values for the computed score.
>
> I have patched a local version of Solr to skip terms for which docFreq
> is 0 in the finish() method of IGainTermsQParserPlugin and this is now
> the result:
>
> {
>    "responseHeader":{
>      "zkConnected":true,
>      "status":0,
>      "QTime":260,
>      "params":{
>        "q":"*:*",
>        "distrib":"false",
>        "positiveLabel":"1",
>        "field":"body",
>        "numTerms":"300",
>        "fq":["category:BUSINESS",
>          "role:train",
>          "{!igain}"],
>        "version":"2",
>        "wt":"json",
>        "outcome":"positive",
>        "_":"1569983546342"}},
>    "featuredTerms":[
>      "3",-0.0173133558644304,
>      "authority",-0.0173133558644304,
>      "brand",-0.0173133558644304,
>      "commission",-0.0173133558644304,
>      "compared",-0.0173133558644304,
>      "condition",-0.0173133558644304,
>      "continuing",-0.0173133558644304,
>      "deficit",-0.0173133558644304,
>      "expectation",-0.0173133558644304,
>
> To my (admittedly inexpert) eye, it seems like this is producing more
> reasonable results.
>
> With this change in place, train() now produces:
>
>      "idfs_ds": [
>            0.6212826193303013,
>            0.6434237452075148,
>            0.7169578292536639,
>            0.741349282377823,
>            0.86843471069652,
>            1.0140549006400466,
>            1.0639267306802198,
>            1.0753554265038423,...
>
> |"terms_ss": [ "รข", "company", "market", "firm", "month", "analyst",
> "chief", "time",|||...| I am not sure if I have missed anything, but this
> seems like it's
> producing better outcomes. I would appreciate any input on whether I
> have missed anything here before I proceed further (JIRA and submit a
> patch). Kind Regards, Peter |
>
>

Reply via email to