Thanks, Sebastian. That makes sense. Just a follow-up question, if I want
to combine the OPIC score and the similarity score, how shall I do it?
Maybe I am wrong, I don't think just putting
scoring-opic|scoring-similarity can do this trick as there is a chance they
will be mixed up, or one gets overwritten by the other at various scoring
steps. Do I have to create two attributes for them and combine them at end
of indexing (indexerscore())?

Yongyao

On Sat, Apr 22, 2017 at 1:15 PM, Sebastian Nagel <[email protected]
> wrote:

> Hi Yongyao,
>
> yes, that sounds reasonable. A simple
>   return datum.getScore() * initSort;
> would do the job. That should be enough, as the similarity score is
> calculated after parsing and distributed to the outlinks. However,
> also
>   updateDbScore(...)
> needs to be implemented accordingly. Otherwise the scores from outlinks
> are newer aggregated in the CrawlDb, only for newly found links the
> similarity
> score is used. The question is whether scoring-similarity wasn't designed
> to be used in combination with another scoring plugin (e.g., scoring-opic)
> which really implements these methods.
>
> Please, open an issue on Jira to discuss any questions and for
> documentation
> and release report, a PR is also welcome!
>
> Thanks,
> Sebastian
>
> On 04/18/2017 09:05 PM, Yongyao Jiang wrote:
> > Hi Sebastian,
> >
> > Yes, I understand. But when people use the similarity-scoring plugin,
> they
> > intend to do domain-specific crawling in most cases. It also means that
> > they want to control how the crawler works by adjusting the
> > generate.min.score.
> >
> > I just figured out the reason that adjusting the min value does not
> change
> > the results is that the "sort" variable in the code below always equals
> 1.0
> > when using the scoring-similarity plugin, because this plugin doesn't
> > implement the "generatorSortValue()" function.
> > https://github.com/apache/nutch/blob/master/src/java/
> org/apache/nutch/crawl/
> > Generator.java#L211
> > https://github.com/apache/nutch/blob/master/src/java/
> > org/apache/nutch/scoring/AbstractScoringFilter.java#L40
> >
> > I think this is supposed to be a bug. Please correct me if I am wrong. I
> > can also submit a PR if needed.
> >
> > Thanks,
> > Yongyao
> >
> > On Tue, Apr 18, 2017 at 1:13 PM, Sebastian Nagel <
> [email protected]
> >> wrote:
> >
> >> Hi,
> >>
> >> the scores in the index is not relevant for generating, only the scores
> in
> >> CrawlDb.
> >> The ScoringFilter interface defines a method indexerScore(...), some
> >> scoring filters
> >> return a modified (normalized) indexer score (cf. indexer.score.power).
> >> Also, changes to
> >> generate.min.score affect only which pages are fetched, pages fetched
> >> before may have a lower score.
> >> The score may also change when a page is processed (parsed, etc.) or
> even
> >> afterwards
> >> (by links pointing to it).
> >>
> >> In short: generate.min.score determines what is crawled, not what is
> >> indexed.
> >>
> >> Best,
> >> Sebastian
> >>
> >> On 04/18/2017 12:31 AM, Yongyao Jiang wrote:
> >>> Hi,
> >>>
> >>> I am using scoring-similarity plugin. After setting the
> >> generate.min.score
> >>> to 0.05, and indexing all the pages (with its score) into Elastic, I
> can
> >>> still observe many web pages whose scores are below 0.05.
> >>>
> >>> <property>
> >>>   <name>generate.min.score</name>
> >>>   <value>0.05</value>
> >>>   <description>Select only entries with a score larger than
> >>>   generate.min.score.</description>
> >>> </property>
> >>>
> >>> Below is the result of a simple aggregation of "score" in ES,
> >>>         {
> >>>                "key": "20170417215917",
> >>>                "doc_count": 200,
> >>>                "Stats": {
> >>>                   "count": 200,
> >>>                   "min": 0,
> >>>                   "max": 0.019184709,
> >>>                   "avg": 0.0012828724450000002,
> >>>                   "sum": 0.256574489
> >>>                }
> >>>             }
> >>>
> >>> Thanks,
> >>> Yongyao
> >>>
> >>
> >>
> >
> >
>
>


-- 
Yongyao Jiang
https://www.linkedin.com/in/yongyao-jiang-42516164
Ph.D. Student in Earth Systems and GeoInformation Sciences
NSF Spatiotemporal Innovation Center
George Mason University

Reply via email to