This sounds like a great patch. I can help with the review and commit after the jira is created.
Thanks! Joel On Fri, Oct 11, 2019 at 1:06 AM Peter Davie < peter.da...@convergentsolutions.com.au> wrote: > Hi, > > I apologise in advance for the length of this email, but I want to share > my discovery steps to make sure that I haven't missed anything during my > investigation... > > I am working on a classification project and will be using the > classify(model()) stream function to classify documents. I have noticed > that models generated include many noise terms from the (lexically) > early part of the term list. To test, I have used the /BBC articles > fulltext and category //dataset from Kaggle/ > (https://www.kaggle.com/yufengdev/bbc-fulltext-and-category). I have > indexed the data into a Solr collection (news_categories) and am > performing the following operation to generate a model for documents > categorised as "BUSINESS" (only keeping the 100th iteration): > > having( > train( > news_categories, > features( > news_categories, > zkHost="localhost:9983", > q="*:*", > fq="role:train", > fq="category:BUSINESS", > featureSet="business", > field="body", > outcome="positive", > numTerms=500 > ), > fq="role:train", > fq="category:BUSINESS", > zkHost="localhost:9983", > name="business_model", > field="body", > outcome="positive", > maxIterations=100 > ), > eq(iteration_i, 100) > ) > > The output generated includes "noise" terms, such as the following > "1,011.15", "10.3m", "01", "02", "03", "10.50", "04", "05", "06", "07", > "09", and these terms all have the same value for idfs_ds ("-Infinity"). > > Investigating the "features()" output, it seems that the issue is that > the noise terms are being returned with NaN for the score_f field: > > "docs": [ > { > "featureSet_s": "business", > "score_f": "NaN", > "term_s": "1,011.15", > "idf_d": "-Infinity", > "index_i": 1, > "id": "business_1" > }, > { > "featureSet_s": "business", > "score_f": "NaN", > "term_s": "10.3m", > "idf_d": "-Infinity", > "index_i": 2, > "id": "business_2" > }, > { > "featureSet_s": "business", > "score_f": "NaN", > "term_s": "01", > "idf_d": "-Infinity", > "index_i": 3, > "id": "business_3" > }, > { > "featureSet_s": "business", > "score_f": "NaN", > "term_s": "02", > "idf_d": "-Infinity", > "index_i": 4, > "id": "business_4" > },... > > I have examined the code within > org/apache/solr/client/solrj/io/streamFeatureSelectionStream.java and > see that the scores being returned by {!igain} include NaN values, as > follows: > > { > "responseHeader":{ > "zkConnected":true, > "status":0, > "QTime":20, > "params":{ > "q":"*:*", > "distrib":"false", > "positiveLabel":"1", > "field":"body", > "numTerms":"300", > "fq":["category:BUSINESS", > "role:train", > "{!igain}"], > "version":"2", > "wt":"json", > "outcome":"positive", > "_":"1569982496170"}}, > "featuredTerms":[ > "0","NaN", > "0.0051","NaN", > "0.01","NaN", > "0.02","NaN", > "0.03","NaN", > > Looking intoorg/apache/solr/search/IGainTermsQParserPlugin.java, it > seems that when a term is not included in the positive or negative > documents, the docFreq calculation (docFreq = xc + nc) is 0, which means > that subsequent calculations result in NaN (division by 0) which > generates these meaningless values for the computed score. > > I have patched a local version of Solr to skip terms for which docFreq > is 0 in the finish() method of IGainTermsQParserPlugin and this is now > the result: > > { > "responseHeader":{ > "zkConnected":true, > "status":0, > "QTime":260, > "params":{ > "q":"*:*", > "distrib":"false", > "positiveLabel":"1", > "field":"body", > "numTerms":"300", > "fq":["category:BUSINESS", > "role:train", > "{!igain}"], > "version":"2", > "wt":"json", > "outcome":"positive", > "_":"1569983546342"}}, > "featuredTerms":[ > "3",-0.0173133558644304, > "authority",-0.0173133558644304, > "brand",-0.0173133558644304, > "commission",-0.0173133558644304, > "compared",-0.0173133558644304, > "condition",-0.0173133558644304, > "continuing",-0.0173133558644304, > "deficit",-0.0173133558644304, > "expectation",-0.0173133558644304, > > To my (admittedly inexpert) eye, it seems like this is producing more > reasonable results. > > With this change in place, train() now produces: > > "idfs_ds": [ > 0.6212826193303013, > 0.6434237452075148, > 0.7169578292536639, > 0.741349282377823, > 0.86843471069652, > 1.0140549006400466, > 1.0639267306802198, > 1.0753554265038423,... > > |"terms_ss": [ "รข", "company", "market", "firm", "month", "analyst", > "chief", "time",|||...| I am not sure if I have missed anything, but this > seems like it's > producing better outcomes. I would appreciate any input on whether I > have missed anything here before I proceed further (JIRA and submit a > patch). Kind Regards, Peter | > >