Hi,

I apologise in advance for the length of this email, but I want to share my discovery steps to make sure that I haven't missed anything during my investigation...

I am working on a classification project and will be using the classify(model()) stream function to classify documents.  I have noticed that models generated include many noise terms from the (lexically) early part of the term list.  To test, I have used the /BBC articles fulltext and category //dataset from Kaggle/ (https://www.kaggle.com/yufengdev/bbc-fulltext-and-category). I have indexed the data into a Solr collection (news_categories) and am performing the following operation to generate a model for documents categorised as "BUSINESS" (only keeping the 100th iteration):

having(
    train(
        news_categories,
        features(
            news_categories,
            zkHost="localhost:9983",
            q="*:*",
            fq="role:train",
            fq="category:BUSINESS",
            featureSet="business",
            field="body",
            outcome="positive",
            numTerms=500
        ),
        fq="role:train",
        fq="category:BUSINESS",
        zkHost="localhost:9983",
        name="business_model",
        field="body",
        outcome="positive",
        maxIterations=100
    ),
    eq(iteration_i, 100)
)

The output generated includes "noise" terms, such as the following "1,011.15", "10.3m", "01", "02", "03", "10.50", "04", "05", "06", "07", "09", and these terms all have the same value for idfs_ds ("-Infinity").

Investigating the "features()" output, it seems that the issue is that the noise terms are being returned with NaN for the score_f field:

    "docs": [
      {
        "featureSet_s": "business",
        "score_f": "NaN",
        "term_s": "1,011.15",
        "idf_d": "-Infinity",
        "index_i": 1,
        "id": "business_1"
      },
      {
        "featureSet_s": "business",
        "score_f": "NaN",
        "term_s": "10.3m",
        "idf_d": "-Infinity",
        "index_i": 2,
        "id": "business_2"
      },
      {
        "featureSet_s": "business",
        "score_f": "NaN",
        "term_s": "01",
        "idf_d": "-Infinity",
        "index_i": 3,
        "id": "business_3"
      },
      {
        "featureSet_s": "business",
        "score_f": "NaN",
        "term_s": "02",
        "idf_d": "-Infinity",
        "index_i": 4,
        "id": "business_4"
      },...

I have examined the code within org/apache/solr/client/solrj/io/streamFeatureSelectionStream.java and see that the scores being returned by {!igain} include NaN values, as follows:

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":20,
    "params":{
      "q":"*:*",
      "distrib":"false",
      "positiveLabel":"1",
      "field":"body",
      "numTerms":"300",
      "fq":["category:BUSINESS",
        "role:train",
        "{!igain}"],
      "version":"2",
      "wt":"json",
      "outcome":"positive",
      "_":"1569982496170"}},
  "featuredTerms":[
    "0","NaN",
    "0.0051","NaN",
    "0.01","NaN",
    "0.02","NaN",
    "0.03","NaN",

Looking intoorg/apache/solr/search/IGainTermsQParserPlugin.java, it seems that when a term is not included in the positive or negative documents, the docFreq calculation (docFreq = xc + nc) is 0, which means that subsequent calculations result in NaN (division by 0) which generates these meaningless values for the computed score.

I have patched a local version of Solr to skip terms for which docFreq is 0 in the finish() method of IGainTermsQParserPlugin and this is now the result:

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":260,
    "params":{
      "q":"*:*",
      "distrib":"false",
      "positiveLabel":"1",
      "field":"body",
      "numTerms":"300",
      "fq":["category:BUSINESS",
        "role:train",
        "{!igain}"],
      "version":"2",
      "wt":"json",
      "outcome":"positive",
      "_":"1569983546342"}},
  "featuredTerms":[
    "3",-0.0173133558644304,
    "authority",-0.0173133558644304,
    "brand",-0.0173133558644304,
    "commission",-0.0173133558644304,
    "compared",-0.0173133558644304,
    "condition",-0.0173133558644304,
    "continuing",-0.0173133558644304,
    "deficit",-0.0173133558644304,
    "expectation",-0.0173133558644304,

To my (admittedly inexpert) eye, it seems like this is producing more reasonable results.

With this change in place, train() now produces:

    "idfs_ds": [
          0.6212826193303013,
          0.6434237452075148,
          0.7169578292536639,
          0.741349282377823,
          0.86843471069652,
          1.0140549006400466,
          1.0639267306802198,
          1.0753554265038423,...

|"terms_ss": [ "â", "company", "market", "firm", "month", "analyst", "chief", "time",|||...| I am not sure if I have missed anything, but this seems like it's producing better outcomes. I would appreciate any input on whether I have missed anything here before I proceed further (JIRA and submit a patch). Kind Regards, Peter |

Reply via email to