Hi Semyon, (bringing the conversation back to the user list, sorry)
> I have some questions about scoring-depth plugin. > Is it recursive? > I saw that the plugin divides overall score on number of outonding > links(score /= validCount), in > other words if the initial score is 1 and we have 100 links, each link will > have 1/100 = 0.01. > But does it mean that the next round of crawling will work with 0.01 as a > score and divide it on > new number of valid Count? Yes, of course. Unless the score isn't also affected by another scoring plugin. > The second question. > Can I access somehow the overall number of links for a host from this plugin? No. > The third question. > How can I use the result of the plugin CrawlDatum.score to prevent the > crawling at the specific > monent. In other words, how can I use it in generate or updatedb to stop > crawling after a threshold. That's the method ScoringFilter.generatorSortValue(...) - it's implemented by scoring-depth in combination with <property> <name>generate.min.score</name> <value>0</value> <description>Select only entries with a score larger than generate.min.score.</description> </property> Best, Sebastian On 10/19/2017 05:33 PM, Semyon Semyonov wrote: > Hi Sebastian, > > Thank you for your answer. > > I have some questions about scoring-depth plugin. > > The first question. > > Is it recursive? > I saw that the plugin divides overall score on number of outonding > links(score /= validCount), in > other words if the initial score is 1 and we have 100 links, each link will > have 1/100 = 0.01. > But does it mean that the next round of crawling will work with 0.01 as a > score and divide it on new > number of valid Count? > > The second question. > Can I access somehow the overall number of links for a host from this plugin? > > The third question. > How can I use the result of the plugin CrawlDatum.score to prevent the > crawling at the specific > monent. In other words, how can I use it in generate or updatedb to stop > crawling after a threshold. > > Semyon. > *Sent:* Thursday, October 19, 2017 at 3:34 PM > *From:* "Sebastian Nagel" <[email protected]> > *To:* "Semyon Semyonov" <[email protected]> > *Subject:* Re: Parsing and URL filter plugins that depend on URL pattern. > Hi Semyon, > >> Should I implement HTML parser from scratch or I can add parsing afterwards? >> What is the best > place to do it and how should I distinguish between different URLs categories? > > Have a look at parse-filter plugin interface > http://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/HtmlParseFilter.html > You'll get the DOM tree and if needed the URL via content.getUrl() > >> I would like to crawl each category to the diffirent depth. > > There is plugin scoring-depth, you could extend it. > > But a pragmatic solution could be to set up 3 crawls with similar > configuration, > only a slightly different URL filters to accept only (website/A/, B or C) and > with different depth. > >> There is an URL filter plugin, > > There are multiple URL filters, all operate only on the URL string matched by > suffix, prefix, regular expression, ... > > > Best, > Sebastian > > On 10/19/2017 01:51 PM, Semyon Semyonov wrote: >> Dear all, >> >> I want to adjust Nutch for crawling of only one big text-based website and >> therefore to set up the > develop plugins/set up settings for the best crawling performance. >> >> Precisely, there is a website that has 3 category : A,B,C. The urls >> therefore website/A/itemN, > website/B/articleN, website/C/descriptioN >> For example category A contains pages web-shop like kind of pages with >> price, ratings etc. B has > articles pages including header, text, author and so on. >> >> 1) How to write the html-parser that produces diffirent key-values pairs for >> diffirent urls > patterns(diffirent HTML patterns) e.g NameOfItem, Price for website/A/ > children, Header, Text for > website/B children? >> Should I implement HTML parser from scratch or I can add parsing afterwards? >> What is the best > place to do it and how should I distinguish between different URLs categories? >> >> 2) Assuming I turned off external links and I crawl only internaly. I would >> like to crawl each > category to the diffirent depth. For example, I want to crawl 50000 pages in > A category, 10000 in B > and only 100 in C. How can I make it in the best way? >> There is an URL filter plugin, but I don't know how to use it based on URL >> pattern or parrent URL > metadata. >> >> Thank you. >> Semyon. >> >

