Hi Yongyao, The code in question is found below https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Generator.java#L230-L232 A few things come to mind here... * are you sure the entries with a lower score than the minimum threshold were not present before you established the threshold configuration? * have you rebuilt the Nutch source code after establishing the configuration such that the desired configuration is available to the Nutch deployment? Lewis
On Mon, Apr 17, 2017 at 3:31 PM, <user-digest-h...@nutch.apache.org> wrote: > > From: Yongyao Jiang <j.yongya...@gmail.com> > To: user@nutch.apache.org > Cc: > Bcc: > Date: Mon, 17 Apr 2017 18:31:05 -0400 > Subject: Why "generate.min.score" does not work? > Hi, > > I am using scoring-similarity plugin. After setting the generate.min.score > to 0.05, and indexing all the pages (with its score) into Elastic, I can > still observe many web pages whose scores are below 0.05. > > <property> > <name>generate.min.score</name> > <value>0.05</value> > <description>Select only entries with a score larger than > generate.min.score.</description> > </property> > > Below is the result of a simple aggregation of "score" in ES, > { > "key": "20170417215917", > "doc_count": 200, > "Stats": { > "count": 200, > "min": 0, > "max": 0.019184709, > "avg": 0.0012828724450000002, > "sum": 0.256574489 > } > } > > Thanks, > Yongyao > > -- http://home.apache.org/~lewismc/ @hectorMcSpector http://www.linkedin.com/in/lmcgibbney