Hi Sebastian,

Yes, I understand. But when people use the similarity-scoring plugin, they
intend to do domain-specific crawling in most cases. It also means that
they want to control how the crawler works by adjusting the
generate.min.score.

I just figured out the reason that adjusting the min value does not change
the results is that the "sort" variable in the code below always equals 1.0
when using the scoring-similarity plugin, because this plugin doesn't
implement the "generatorSortValue()" function.
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/
Generator.java#L211
https://github.com/apache/nutch/blob/master/src/java/
org/apache/nutch/scoring/AbstractScoringFilter.java#L40

I think this is supposed to be a bug. Please correct me if I am wrong. I
can also submit a PR if needed.

Thanks,
Yongyao

On Tue, Apr 18, 2017 at 1:13 PM, Sebastian Nagel <[email protected]
> wrote:

> Hi,
>
> the scores in the index is not relevant for generating, only the scores in
> CrawlDb.
> The ScoringFilter interface defines a method indexerScore(...), some
> scoring filters
> return a modified (normalized) indexer score (cf. indexer.score.power).
> Also, changes to
> generate.min.score affect only which pages are fetched, pages fetched
> before may have a lower score.
> The score may also change when a page is processed (parsed, etc.) or even
> afterwards
> (by links pointing to it).
>
> In short: generate.min.score determines what is crawled, not what is
> indexed.
>
> Best,
> Sebastian
>
> On 04/18/2017 12:31 AM, Yongyao Jiang wrote:
> > Hi,
> >
> > I am using scoring-similarity plugin. After setting the
> generate.min.score
> > to 0.05, and indexing all the pages (with its score) into Elastic, I can
> > still observe many web pages whose scores are below 0.05.
> >
> > <property>
> >   <name>generate.min.score</name>
> >   <value>0.05</value>
> >   <description>Select only entries with a score larger than
> >   generate.min.score.</description>
> > </property>
> >
> > Below is the result of a simple aggregation of "score" in ES,
> >         {
> >                "key": "20170417215917",
> >                "doc_count": 200,
> >                "Stats": {
> >                   "count": 200,
> >                   "min": 0,
> >                   "max": 0.019184709,
> >                   "avg": 0.0012828724450000002,
> >                   "sum": 0.256574489
> >                }
> >             }
> >
> > Thanks,
> > Yongyao
> >
>
>


-- 
Yongyao Jiang
https://www.linkedin.com/in/yongyao-jiang-42516164
Ph.D. Student in Earth Systems and GeoInformation Sciences
NSF Spatiotemporal Innovation Center
George Mason University

Reply via email to