Hi Yongyao,
The code in question is found below
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Generator.java#L230-L232
A few things come to mind here...
 * are you sure the entries with a lower score than the minimum threshold
were not present before you established the threshold configuration?
 * have you rebuilt the Nutch source code after establishing the
configuration such that the desired configuration is available to the Nutch
deployment?
Lewis

On Mon, Apr 17, 2017 at 3:31 PM, <user-digest-h...@nutch.apache.org> wrote:

>
> From: Yongyao Jiang <j.yongya...@gmail.com>
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Mon, 17 Apr 2017 18:31:05 -0400
> Subject: Why "generate.min.score" does not work?
> Hi,
>
> I am using scoring-similarity plugin. After setting the generate.min.score
> to 0.05, and indexing all the pages (with its score) into Elastic, I can
> still observe many web pages whose scores are below 0.05.
>
> <property>
>   <name>generate.min.score</name>
>   <value>0.05</value>
>   <description>Select only entries with a score larger than
>   generate.min.score.</description>
> </property>
>
> Below is the result of a simple aggregation of "score" in ES,
>         {
>                "key": "20170417215917",
>                "doc_count": 200,
>                "Stats": {
>                   "count": 200,
>                   "min": 0,
>                   "max": 0.019184709,
>                   "avg": 0.0012828724450000002,
>                   "sum": 0.256574489
>                }
>             }
>
> Thanks,
> Yongyao
>
>


-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney

Reply via email to