Thanks, Sebastian. That makes sense. Just a follow-up question, if I want to combine the OPIC score and the similarity score, how shall I do it? Maybe I am wrong, I don't think just putting scoring-opic|scoring-similarity can do this trick as there is a chance they will be mixed up, or one gets overwritten by the other at various scoring steps. Do I have to create two attributes for them and combine them at end of indexing (indexerscore())?
Yongyao On Sat, Apr 22, 2017 at 1:15 PM, Sebastian Nagel <[email protected] > wrote: > Hi Yongyao, > > yes, that sounds reasonable. A simple > return datum.getScore() * initSort; > would do the job. That should be enough, as the similarity score is > calculated after parsing and distributed to the outlinks. However, > also > updateDbScore(...) > needs to be implemented accordingly. Otherwise the scores from outlinks > are newer aggregated in the CrawlDb, only for newly found links the > similarity > score is used. The question is whether scoring-similarity wasn't designed > to be used in combination with another scoring plugin (e.g., scoring-opic) > which really implements these methods. > > Please, open an issue on Jira to discuss any questions and for > documentation > and release report, a PR is also welcome! > > Thanks, > Sebastian > > On 04/18/2017 09:05 PM, Yongyao Jiang wrote: > > Hi Sebastian, > > > > Yes, I understand. But when people use the similarity-scoring plugin, > they > > intend to do domain-specific crawling in most cases. It also means that > > they want to control how the crawler works by adjusting the > > generate.min.score. > > > > I just figured out the reason that adjusting the min value does not > change > > the results is that the "sort" variable in the code below always equals > 1.0 > > when using the scoring-similarity plugin, because this plugin doesn't > > implement the "generatorSortValue()" function. > > https://github.com/apache/nutch/blob/master/src/java/ > org/apache/nutch/crawl/ > > Generator.java#L211 > > https://github.com/apache/nutch/blob/master/src/java/ > > org/apache/nutch/scoring/AbstractScoringFilter.java#L40 > > > > I think this is supposed to be a bug. Please correct me if I am wrong. I > > can also submit a PR if needed. > > > > Thanks, > > Yongyao > > > > On Tue, Apr 18, 2017 at 1:13 PM, Sebastian Nagel < > [email protected] > >> wrote: > > > >> Hi, > >> > >> the scores in the index is not relevant for generating, only the scores > in > >> CrawlDb. > >> The ScoringFilter interface defines a method indexerScore(...), some > >> scoring filters > >> return a modified (normalized) indexer score (cf. indexer.score.power). > >> Also, changes to > >> generate.min.score affect only which pages are fetched, pages fetched > >> before may have a lower score. > >> The score may also change when a page is processed (parsed, etc.) or > even > >> afterwards > >> (by links pointing to it). > >> > >> In short: generate.min.score determines what is crawled, not what is > >> indexed. > >> > >> Best, > >> Sebastian > >> > >> On 04/18/2017 12:31 AM, Yongyao Jiang wrote: > >>> Hi, > >>> > >>> I am using scoring-similarity plugin. After setting the > >> generate.min.score > >>> to 0.05, and indexing all the pages (with its score) into Elastic, I > can > >>> still observe many web pages whose scores are below 0.05. > >>> > >>> <property> > >>> <name>generate.min.score</name> > >>> <value>0.05</value> > >>> <description>Select only entries with a score larger than > >>> generate.min.score.</description> > >>> </property> > >>> > >>> Below is the result of a simple aggregation of "score" in ES, > >>> { > >>> "key": "20170417215917", > >>> "doc_count": 200, > >>> "Stats": { > >>> "count": 200, > >>> "min": 0, > >>> "max": 0.019184709, > >>> "avg": 0.0012828724450000002, > >>> "sum": 0.256574489 > >>> } > >>> } > >>> > >>> Thanks, > >>> Yongyao > >>> > >> > >> > > > > > > -- Yongyao Jiang https://www.linkedin.com/in/yongyao-jiang-42516164 Ph.D. Student in Earth Systems and GeoInformation Sciences NSF Spatiotemporal Innovation Center George Mason University

