Re: Parsing and URL filter plugins that depend on URL pattern.

Sebastian Nagel Thu, 19 Oct 2017 08:45:51 -0700

Hi Semyon,

(bringing the conversation back to the user list, sorry)


> I have some questions about scoring-depth plugin.

> Is it recursive?
> I saw that the plugin divides overall score on number of outonding 
> links(score /= validCount), in
> other words if the initial score is 1 and we have 100 links, each link will 
> have 1/100 = 0.01.
> But does it mean that the next round of crawling will work with 0.01 as a 
> score and divide it on
> new number of valid Count?

Yes, of course. Unless the score isn't also affected by another scoring plugin.

> The second question.
> Can I access somehow the overall number of links for a host from this plugin?

No.

> The third question.
> How can I use the result of the plugin CrawlDatum.score to prevent the 
> crawling at the specific
> monent. In other words, how can I use it in generate or updatedb to stop 
> crawling after a threshold.

That's the method ScoringFilter.generatorSortValue(...) - it's implemented by 
scoring-depth
in combination with

<property>
  <name>generate.min.score</name>
  <value>0</value>
  <description>Select only entries with a score larger than
  generate.min.score.</description>
</property>

Best,
Sebastian


On 10/19/2017 05:33 PM, Semyon Semyonov wrote:
> Hi Sebastian,
> 
> Thank you for your answer.
> 
> I have some questions about scoring-depth plugin.
> 
> The first question.
> 
> Is it recursive?
> I saw that the plugin divides overall score on number of outonding 
> links(score /= validCount), in
> other words if the initial score is 1 and we have 100 links, each link will 
> have 1/100 = 0.01. 
> But does it mean that the next round of crawling will work with 0.01 as a 
> score and divide it on new
> number of valid Count?
>  
> The second question. 
> Can I access somehow the overall number of links for a host from this plugin?
> 
> The third question. 
> How can I use the result of the plugin CrawlDatum.score to prevent the 
> crawling at the specific
> monent. In other words, how can I use it in generate or updatedb to stop 
> crawling after a threshold. 
> 
> Semyon.
> *Sent:* Thursday, October 19, 2017 at 3:34 PM
> *From:* "Sebastian Nagel" <[email protected]>
> *To:* "Semyon Semyonov" <[email protected]>
> *Subject:* Re: Parsing and URL filter plugins that depend on URL pattern.
> Hi Semyon,
> 
>> Should I implement HTML parser from scratch or I can add parsing afterwards? 
>> What is the best
> place to do it and how should I distinguish between different URLs categories?
> 
> Have a look at parse-filter plugin interface
> http://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/HtmlParseFilter.html
> You'll get the DOM tree and if needed the URL via content.getUrl()
> 
>> I would like to crawl each category to the diffirent depth.
> 
> There is plugin scoring-depth, you could extend it.
> 
> But a pragmatic solution could be to set up 3 crawls with similar 
> configuration,
> only a slightly different URL filters to accept only (website/A/, B or C) and
> with different depth.
> 
>> There is an URL filter plugin,
> 
> There are multiple URL filters, all operate only on the URL string matched by
> suffix, prefix, regular expression, ...
> 
> 
> Best,
> Sebastian
> 
> On 10/19/2017 01:51 PM, Semyon Semyonov wrote:
>> Dear all,
>>
>> I want to adjust Nutch for crawling of only one big text-based website and 
>> therefore to set up the
> develop plugins/set up settings for the best crawling performance.
>>
>> Precisely, there is a website that has 3 category : A,B,C. The urls 
>> therefore website/A/itemN,
> website/B/articleN, website/C/descriptioN
>> For example category A contains pages web-shop like kind of pages with 
>> price, ratings etc. B has
> articles pages including header, text, author and so on.
>>
>> 1) How to write the html-parser that produces diffirent key-values pairs for 
>> diffirent urls
> patterns(diffirent HTML patterns) e.g NameOfItem, Price for website/A/ 
> children, Header, Text for
> website/B children?
>> Should I implement HTML parser from scratch or I can add parsing afterwards? 
>> What is the best
> place to do it and how should I distinguish between different URLs categories?
>>
>> 2) Assuming I turned off external links and I crawl only internaly. I would 
>> like to crawl each
> category to the diffirent depth. For example, I want to crawl 50000 pages in 
> A category, 10000 in B
> and only 100 in C. How can I make it in the best way? 
>> There is an URL filter plugin, but I don't know how to use it based on URL 
>> pattern or parrent URL
> metadata.
>>
>> Thank you.
>> Semyon.
>>
>

Re: Parsing and URL filter plugins that depend on URL pattern.

Reply via email to