Re: igain query parser generating invalid output

2019-10-12 Thread Peter Davie

Hi,

I have created the bug report in Jira and attached the patch to it.

Kind Regards,
Peter

On 12/10/2019 2:34 am, Joel Bernstein wrote:

This sounds like a great patch. I can help with the review and commit after
the jira is created.

Thanks!

Joel


On Fri, Oct 11, 2019 at 1:06 AM Peter Davie <
peter.da...@convergentsolutions.com.au> wrote:


Hi,

I apologise in advance for the length of this email, but I want to share
my discovery steps to make sure that I haven't missed anything during my
investigation...

I am working on a classification project and will be using the
classify(model()) stream function to classify documents.  I have noticed
that models generated include many noise terms from the (lexically)
early part of the term list.  To test, I have used the /BBC articles
fulltext and category //dataset from Kaggle/
(https://www.kaggle.com/yufengdev/bbc-fulltext-and-category). I have
indexed the data into a Solr collection (news_categories) and am
performing the following operation to generate a model for documents
categorised as "BUSINESS" (only keeping the 100th iteration):

having(
  train(
  news_categories,
  features(
  news_categories,
  zkHost="localhost:9983",
  q="*:*",
  fq="role:train",
  fq="category:BUSINESS",
  featureSet="business",
  field="body",
  outcome="positive",
  numTerms=500
  ),
  fq="role:train",
  fq="category:BUSINESS",
  zkHost="localhost:9983",
  name="business_model",
  field="body",
  outcome="positive",
  maxIterations=100
  ),
  eq(iteration_i, 100)
)

The output generated includes "noise" terms, such as the following
"1,011.15", "10.3m", "01", "02", "03", "10.50", "04", "05", "06", "07",
"09", and these terms all have the same value for idfs_ds ("-Infinity").

Investigating the "features()" output, it seems that the issue is that
the noise terms are being returned with NaN for the score_f field:

  "docs": [
{
  "featureSet_s": "business",
  "score_f": "NaN",
  "term_s": "1,011.15",
  "idf_d": "-Infinity",
  "index_i": 1,
  "id": "business_1"
},
{
  "featureSet_s": "business",
  "score_f": "NaN",
  "term_s": "10.3m",
  "idf_d": "-Infinity",
  "index_i": 2,
  "id": "business_2"
},
{
  "featureSet_s": "business",
  "score_f": "NaN",
  "term_s": "01",
  "idf_d": "-Infinity",
  "index_i": 3,
  "id": "business_3"
},
{
  "featureSet_s": "business",
  "score_f": "NaN",
  "term_s": "02",
  "idf_d": "-Infinity",
  "index_i": 4,
  "id": "business_4"
},...

I have examined the code within
org/apache/solr/client/solrj/io/streamFeatureSelectionStream.java and
see that the scores being returned by {!igain} include NaN values, as
follows:

{
"responseHeader":{
  "zkConnected":true,
  "status":0,
  "QTime":20,
  "params":{
"q":"*:*",
"distrib":"false",
"positiveLabel":"1",
"field":"body",
"numTerms":"300",
"fq":["category:BUSINESS",
  "role:train",
  "{!igain}"],
"version":"2",
"wt":"json",
"outcome":"positive",
"_":"1569982496170"}},
"featuredTerms":[
  "0","NaN",
  "0.0051","NaN",
  "0.01","NaN",
  "0.02","NaN",
  "0.03","NaN",

Looking intoorg/apache/solr/search/IGainTermsQParserPlugin.java, it
seems that when a term is not included in the positive or negative
documents, the docFreq calculation (docFreq = xc + nc) is 0, which means
that subsequent calculations result in NaN (division by 0) which
generates these meaningless values for the computed score.

I have patched a local version of Solr to skip terms for which docFreq
is 0 in the finish() method of IGainTermsQParserPlugin and this is now
the result:

{
"responseHeader":{
  "zkConnected":true,
  "status":0,
  "QTime":260,
  "params":{
"q":"*:*",
"distrib":"false",
"positiveLabel":"1",
"field":"body",
"numTerms":"300",
"fq":["category:BUSINESS",
  "role:train",
  "{!igain}"],
"version":"2",
"wt":"json",
"outcome":"positive",
"_":"1569983546342"}},
"featuredTerms":[
  "3",-0.0173133558644304,
  "authority",-0.0173133558644304,
  "brand",-0.0173133558644304,
  "commission",-0.0173133558644304,
  "compared",-0.0173133558644304,
  "condition",-0.0173133558644304,
  "continuing",-0.0173133558644304,
  "deficit",-0.0173133558644304,
  "expectation",-0.0173133558644304,

To my (admittedly inexpert) eye, it seems like this is producing more
reasonable results.

With this change in 

Re: igain query parser generating invalid output

2019-10-11 Thread Joel Bernstein
This sounds like a great patch. I can help with the review and commit after
the jira is created.

Thanks!

Joel


On Fri, Oct 11, 2019 at 1:06 AM Peter Davie <
peter.da...@convergentsolutions.com.au> wrote:

> Hi,
>
> I apologise in advance for the length of this email, but I want to share
> my discovery steps to make sure that I haven't missed anything during my
> investigation...
>
> I am working on a classification project and will be using the
> classify(model()) stream function to classify documents.  I have noticed
> that models generated include many noise terms from the (lexically)
> early part of the term list.  To test, I have used the /BBC articles
> fulltext and category //dataset from Kaggle/
> (https://www.kaggle.com/yufengdev/bbc-fulltext-and-category). I have
> indexed the data into a Solr collection (news_categories) and am
> performing the following operation to generate a model for documents
> categorised as "BUSINESS" (only keeping the 100th iteration):
>
> having(
>  train(
>  news_categories,
>  features(
>  news_categories,
>  zkHost="localhost:9983",
>  q="*:*",
>  fq="role:train",
>  fq="category:BUSINESS",
>  featureSet="business",
>  field="body",
>  outcome="positive",
>  numTerms=500
>  ),
>  fq="role:train",
>  fq="category:BUSINESS",
>  zkHost="localhost:9983",
>  name="business_model",
>  field="body",
>  outcome="positive",
>  maxIterations=100
>  ),
>  eq(iteration_i, 100)
> )
>
> The output generated includes "noise" terms, such as the following
> "1,011.15", "10.3m", "01", "02", "03", "10.50", "04", "05", "06", "07",
> "09", and these terms all have the same value for idfs_ds ("-Infinity").
>
> Investigating the "features()" output, it seems that the issue is that
> the noise terms are being returned with NaN for the score_f field:
>
>  "docs": [
>{
>  "featureSet_s": "business",
>  "score_f": "NaN",
>  "term_s": "1,011.15",
>  "idf_d": "-Infinity",
>  "index_i": 1,
>  "id": "business_1"
>},
>{
>  "featureSet_s": "business",
>  "score_f": "NaN",
>  "term_s": "10.3m",
>  "idf_d": "-Infinity",
>  "index_i": 2,
>  "id": "business_2"
>},
>{
>  "featureSet_s": "business",
>  "score_f": "NaN",
>  "term_s": "01",
>  "idf_d": "-Infinity",
>  "index_i": 3,
>  "id": "business_3"
>},
>{
>  "featureSet_s": "business",
>  "score_f": "NaN",
>  "term_s": "02",
>  "idf_d": "-Infinity",
>  "index_i": 4,
>  "id": "business_4"
>},...
>
> I have examined the code within
> org/apache/solr/client/solrj/io/streamFeatureSelectionStream.java and
> see that the scores being returned by {!igain} include NaN values, as
> follows:
>
> {
>"responseHeader":{
>  "zkConnected":true,
>  "status":0,
>  "QTime":20,
>  "params":{
>"q":"*:*",
>"distrib":"false",
>"positiveLabel":"1",
>"field":"body",
>"numTerms":"300",
>"fq":["category:BUSINESS",
>  "role:train",
>  "{!igain}"],
>"version":"2",
>"wt":"json",
>"outcome":"positive",
>"_":"1569982496170"}},
>"featuredTerms":[
>  "0","NaN",
>  "0.0051","NaN",
>  "0.01","NaN",
>  "0.02","NaN",
>  "0.03","NaN",
>
> Looking intoorg/apache/solr/search/IGainTermsQParserPlugin.java, it
> seems that when a term is not included in the positive or negative
> documents, the docFreq calculation (docFreq = xc + nc) is 0, which means
> that subsequent calculations result in NaN (division by 0) which
> generates these meaningless values for the computed score.
>
> I have patched a local version of Solr to skip terms for which docFreq
> is 0 in the finish() method of IGainTermsQParserPlugin and this is now
> the result:
>
> {
>"responseHeader":{
>  "zkConnected":true,
>  "status":0,
>  "QTime":260,
>  "params":{
>"q":"*:*",
>"distrib":"false",
>"positiveLabel":"1",
>"field":"body",
>"numTerms":"300",
>"fq":["category:BUSINESS",
>  "role:train",
>  "{!igain}"],
>"version":"2",
>"wt":"json",
>"outcome":"positive",
>"_":"1569983546342"}},
>"featuredTerms":[
>  "3",-0.0173133558644304,
>  "authority",-0.0173133558644304,
>  "brand",-0.0173133558644304,
>  "commission",-0.0173133558644304,
>  "compared",-0.0173133558644304,
>  "condition",-0.0173133558644304,
>  "continuing",-0.0173133558644304,
>  "deficit",-0.0173133558644304,
>  "expectation",-0.0173133558644304,
>
> To my (admittedly inexpert) eye, it seems like this is producing more

igain query parser generating invalid output

2019-10-10 Thread Peter Davie

Hi,

I apologise in advance for the length of this email, but I want to share 
my discovery steps to make sure that I haven't missed anything during my 
investigation...


I am working on a classification project and will be using the 
classify(model()) stream function to classify documents.  I have noticed 
that models generated include many noise terms from the (lexically) 
early part of the term list.  To test, I have used the /BBC articles 
fulltext and category //dataset from Kaggle/ 
(https://www.kaggle.com/yufengdev/bbc-fulltext-and-category). I have 
indexed the data into a Solr collection (news_categories) and am 
performing the following operation to generate a model for documents 
categorised as "BUSINESS" (only keeping the 100th iteration):


having(
    train(
        news_categories,
        features(
        news_categories,
        zkHost="localhost:9983",
        q="*:*",
        fq="role:train",
        fq="category:BUSINESS",
        featureSet="business",
        field="body",
        outcome="positive",
        numTerms=500
        ),
        fq="role:train",
        fq="category:BUSINESS",
        zkHost="localhost:9983",
        name="business_model",
        field="body",
        outcome="positive",
        maxIterations=100
    ),
    eq(iteration_i, 100)
)

The output generated includes "noise" terms, such as the following 
"1,011.15", "10.3m", "01", "02", "03", "10.50", "04", "05", "06", "07", 
"09", and these terms all have the same value for idfs_ds ("-Infinity").


Investigating the "features()" output, it seems that the issue is that 
the noise terms are being returned with NaN for the score_f field:


    "docs": [
  {
    "featureSet_s": "business",
    "score_f": "NaN",
    "term_s": "1,011.15",
    "idf_d": "-Infinity",
    "index_i": 1,
    "id": "business_1"
  },
  {
    "featureSet_s": "business",
    "score_f": "NaN",
    "term_s": "10.3m",
    "idf_d": "-Infinity",
    "index_i": 2,
    "id": "business_2"
  },
  {
    "featureSet_s": "business",
    "score_f": "NaN",
    "term_s": "01",
    "idf_d": "-Infinity",
    "index_i": 3,
    "id": "business_3"
  },
  {
    "featureSet_s": "business",
    "score_f": "NaN",
    "term_s": "02",
    "idf_d": "-Infinity",
    "index_i": 4,
    "id": "business_4"
  },...

I have examined the code within 
org/apache/solr/client/solrj/io/streamFeatureSelectionStream.java and 
see that the scores being returned by {!igain} include NaN values, as 
follows:


{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":20,
    "params":{
  "q":"*:*",
  "distrib":"false",
  "positiveLabel":"1",
  "field":"body",
  "numTerms":"300",
  "fq":["category:BUSINESS",
    "role:train",
    "{!igain}"],
  "version":"2",
  "wt":"json",
  "outcome":"positive",
  "_":"1569982496170"}},
  "featuredTerms":[
    "0","NaN",
    "0.0051","NaN",
    "0.01","NaN",
    "0.02","NaN",
    "0.03","NaN",

Looking intoorg/apache/solr/search/IGainTermsQParserPlugin.java, it 
seems that when a term is not included in the positive or negative 
documents, the docFreq calculation (docFreq = xc + nc) is 0, which means 
that subsequent calculations result in NaN (division by 0) which 
generates these meaningless values for the computed score.


I have patched a local version of Solr to skip terms for which docFreq 
is 0 in the finish() method of IGainTermsQParserPlugin and this is now 
the result:


{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":260,
    "params":{
  "q":"*:*",
  "distrib":"false",
  "positiveLabel":"1",
  "field":"body",
  "numTerms":"300",
  "fq":["category:BUSINESS",
    "role:train",
    "{!igain}"],
  "version":"2",
  "wt":"json",
  "outcome":"positive",
  "_":"1569983546342"}},
  "featuredTerms":[
    "3",-0.0173133558644304,
    "authority",-0.0173133558644304,
    "brand",-0.0173133558644304,
    "commission",-0.0173133558644304,
    "compared",-0.0173133558644304,
    "condition",-0.0173133558644304,
    "continuing",-0.0173133558644304,
    "deficit",-0.0173133558644304,
    "expectation",-0.0173133558644304,

To my (admittedly inexpert) eye, it seems like this is producing more 
reasonable results.


With this change in place, train() now produces:

    "idfs_ds": [
  0.6212826193303013,
  0.6434237452075148,
  0.7169578292536639,
  0.741349282377823,
  0.86843471069652,
  1.0140549006400466,
  1.0639267306802198,
  1.0753554265038423,...

|"terms_ss": [ "â", "company", "market", "firm", "month", "analyst", 
"chief", "time",|||...| I am not sure if I have missed anything, but this seems like it's 
producing better outcomes. I would appreciate any input on whether I 
have missed