This sounds like a great patch. I can help with the review and commit after
the jira is created.
Thanks!
Joel
On Fri, Oct 11, 2019 at 1:06 AM Peter Davie <
peter.da...@convergentsolutions.com.au> wrote:
> Hi,
>
> I apologise in advance for the length of this email, but I want to share
> my discovery steps to make sure that I haven't missed anything during my
> investigation...
>
> I am working on a classification project and will be using the
> classify(model()) stream function to classify documents. I have noticed
> that models generated include many noise terms from the (lexically)
> early part of the term list. To test, I have used the /BBC articles
> fulltext and category //dataset from Kaggle/
> (https://www.kaggle.com/yufengdev/bbc-fulltext-and-category). I have
> indexed the data into a Solr collection (news_categories) and am
> performing the following operation to generate a model for documents
> categorised as "BUSINESS" (only keeping the 100th iteration):
>
> having(
> train(
> news_categories,
> features(
> news_categories,
> zkHost="localhost:9983",
> q="*:*",
> fq="role:train",
> fq="category:BUSINESS",
> featureSet="business",
> field="body",
> outcome="positive",
> numTerms=500
> ),
> fq="role:train",
> fq="category:BUSINESS",
> zkHost="localhost:9983",
> name="business_model",
> field="body",
> outcome="positive",
> maxIterations=100
> ),
> eq(iteration_i, 100)
> )
>
> The output generated includes "noise" terms, such as the following
> "1,011.15", "10.3m", "01", "02", "03", "10.50", "04", "05", "06", "07",
> "09", and these terms all have the same value for idfs_ds ("-Infinity").
>
> Investigating the "features()" output, it seems that the issue is that
> the noise terms are being returned with NaN for the score_f field:
>
> "docs": [
>{
> "featureSet_s": "business",
> "score_f": "NaN",
> "term_s": "1,011.15",
> "idf_d": "-Infinity",
> "index_i": 1,
> "id": "business_1"
>},
>{
> "featureSet_s": "business",
> "score_f": "NaN",
> "term_s": "10.3m",
> "idf_d": "-Infinity",
> "index_i": 2,
> "id": "business_2"
>},
>{
> "featureSet_s": "business",
> "score_f": "NaN",
> "term_s": "01",
> "idf_d": "-Infinity",
> "index_i": 3,
> "id": "business_3"
>},
>{
> "featureSet_s": "business",
> "score_f": "NaN",
> "term_s": "02",
> "idf_d": "-Infinity",
> "index_i": 4,
> "id": "business_4"
>},...
>
> I have examined the code within
> org/apache/solr/client/solrj/io/streamFeatureSelectionStream.java and
> see that the scores being returned by {!igain} include NaN values, as
> follows:
>
> {
>"responseHeader":{
> "zkConnected":true,
> "status":0,
> "QTime":20,
> "params":{
>"q":"*:*",
>"distrib":"false",
>"positiveLabel":"1",
>"field":"body",
>"numTerms":"300",
>"fq":["category:BUSINESS",
> "role:train",
> "{!igain}"],
>"version":"2",
>"wt":"json",
>"outcome":"positive",
>"_":"1569982496170"}},
>"featuredTerms":[
> "0","NaN",
> "0.0051","NaN",
> "0.01","NaN",
> "0.02","NaN",
> "0.03","NaN",
>
> Looking intoorg/apache/solr/search/IGainTermsQParserPlugin.java, it
> seems that when a term is not included in the positive or negative
> documents, the docFreq calculation (docFreq = xc + nc) is 0, which means
> that subsequent calculations result in NaN (division by 0) which
> generates these meaningless values for the computed score.
>
> I have patched a local version of Solr to skip terms for which docFreq
> is 0 in the finish() method of IGainTermsQParserPlugin and this is now
> the result:
>
> {
>"responseHeader":{
> "zkConnected":true,
> "status":0,
> "QTime":260,
> "params":{
>"q":"*:*",
>"distrib":"false",
>"positiveLabel":"1",
>"field":"body",
>"numTerms":"300",
>"fq":["category:BUSINESS",
> "role:train",
> "{!igain}"],
>"version":"2",
>"wt":"json",
>"outcome":"positive",
>"_":"1569983546342"}},
>"featuredTerms":[
> "3",-0.0173133558644304,
> "authority",-0.0173133558644304,
> "brand",-0.0173133558644304,
> "commission",-0.0173133558644304,
> "compared",-0.0173133558644304,
> "condition",-0.0173133558644304,
> "continuing",-0.0173133558644304,
> "deficit",-0.0173133558644304,
> "expectation",-0.0173133558644304,
>
> To my (admittedly inexpert) eye, it seems like this is producing more