igain query parser generating invalid output

Peter Davie Thu, 10 Oct 2019 22:07:28 -0700

Hi,

I apologise in advance for the length of this email, but I want to sharemy discovery steps to make sure that I haven't missed anything during myinvestigation...

I am working on a classification project and will be using theclassify(model()) stream function to classify documents. I have noticedthat models generated include many noise terms from the (lexically)early part of the term list. To test, I have used the /BBC articlesfulltext and category //dataset from Kaggle/(https://www.kaggle.com/yufengdev/bbc-fulltext-and-category). I haveindexed the data into a Solr collection (news_categories) and amperforming the following operation to generate a model for documentscategorised as "BUSINESS" (only keeping the 100th iteration):


having(
    train(
        news_categories,
        features(
            news_categories,
            zkHost="localhost:9983",
            q="*:*",
            fq="role:train",
            fq="category:BUSINESS",
            featureSet="business",
            field="body",
            outcome="positive",
            numTerms=500
        ),
        fq="role:train",
        fq="category:BUSINESS",
        zkHost="localhost:9983",
        name="business_model",
        field="body",
        outcome="positive",
        maxIterations=100
    ),
    eq(iteration_i, 100)
)

The output generated includes "noise" terms, such as the following"1,011.15", "10.3m", "01", "02", "03", "10.50", "04", "05", "06", "07","09", and these terms all have the same value for idfs_ds ("-Infinity").

Investigating the "features()" output, it seems that the issue is thatthe noise terms are being returned with NaN for the score_f field:


    "docs": [
      {
        "featureSet_s": "business",
        "score_f": "NaN",
        "term_s": "1,011.15",
        "idf_d": "-Infinity",
        "index_i": 1,
        "id": "business_1"
      },
      {
        "featureSet_s": "business",
        "score_f": "NaN",
        "term_s": "10.3m",
        "idf_d": "-Infinity",
        "index_i": 2,
        "id": "business_2"
      },
      {
        "featureSet_s": "business",
        "score_f": "NaN",
        "term_s": "01",
        "idf_d": "-Infinity",
        "index_i": 3,
        "id": "business_3"
      },
      {
        "featureSet_s": "business",
        "score_f": "NaN",
        "term_s": "02",
        "idf_d": "-Infinity",
        "index_i": 4,
        "id": "business_4"
      },...

I have examined the code withinorg/apache/solr/client/solrj/io/streamFeatureSelectionStream.java andsee that the scores being returned by {!igain} include NaN values, asfollows:


{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":20,
    "params":{
      "q":"*:*",
      "distrib":"false",
      "positiveLabel":"1",
      "field":"body",
      "numTerms":"300",
      "fq":["category:BUSINESS",
        "role:train",
        "{!igain}"],
      "version":"2",
      "wt":"json",
      "outcome":"positive",
      "_":"1569982496170"}},
  "featuredTerms":[
    "0","NaN",
    "0.0051","NaN",
    "0.01","NaN",
    "0.02","NaN",
    "0.03","NaN",

Looking intoorg/apache/solr/search/IGainTermsQParserPlugin.java, itseems that when a term is not included in the positive or negativedocuments, the docFreq calculation (docFreq = xc + nc) is 0, which meansthat subsequent calculations result in NaN (division by 0) whichgenerates these meaningless values for the computed score.

I have patched a local version of Solr to skip terms for which docFreqis 0 in the finish() method of IGainTermsQParserPlugin and this is nowthe result:


{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":260,
    "params":{
      "q":"*:*",
      "distrib":"false",
      "positiveLabel":"1",
      "field":"body",
      "numTerms":"300",
      "fq":["category:BUSINESS",
        "role:train",
        "{!igain}"],
      "version":"2",
      "wt":"json",
      "outcome":"positive",
      "_":"1569983546342"}},
  "featuredTerms":[
    "3",-0.0173133558644304,
    "authority",-0.0173133558644304,
    "brand",-0.0173133558644304,
    "commission",-0.0173133558644304,
    "compared",-0.0173133558644304,
    "condition",-0.0173133558644304,
    "continuing",-0.0173133558644304,
    "deficit",-0.0173133558644304,
    "expectation",-0.0173133558644304,

To my (admittedly inexpert) eye, it seems like this is producing morereasonable results.


With this change in place, train() now produces:

    "idfs_ds": [
          0.6212826193303013,
          0.6434237452075148,
          0.7169578292536639,
          0.741349282377823,
          0.86843471069652,
          1.0140549006400466,
          1.0639267306802198,
          1.0753554265038423,...

|"terms_ss": [ "â", "company", "market", "firm", "month", "analyst","chief", "time",|||...| I am not sure if I have missed anything, but this seems like it'sproducing better outcomes. I would appreciate any input on whether Ihave missed anything here before I proceed further (JIRA and submit apatch). Kind Regards, Peter |

igain query parser generating invalid output

Reply via email to