> It seems it is a science, not a tool :))) ... grown over time (since 2002) to cover many use case always with scale (millions, billions) in mind.
> How exactely should "_maxdepth=2” as seed metadata be (where) specified In the seed URLs file separated by a tab character (\u0009): http://example.com/ _maxdepth_=2 Best, Sebastian On 04/12/2017 10:13 AM, Fabio Ricci wrote: > Dear Sebastian and Ben - thank you so far for your hints! > It seems it is a science, not a tool :))) > I was considering simply a graph (built upon URL’s) which is built > (=“injected” in the nutch universum) and explored with a radius (a depth). > > Instead, and surely because of other considerations (mass crawling aspects) > there seem to be other control parameters which rather “approximate” this > simple concept … > > Anyway in nutch 1.13 /conf/nutch-site.xml - thanks to your kind hint - there > is a section like: > > <property> > <name>scoring.depth.max</name> > <value>1000</value> > <description>Max depth value from seed allowed by default. > Can be overridden on a per-seed basis by specifying "_maxdepth_=VALUE" > as a seed metadata. This plugin adds a "_depth_" metadatum to the pages > to track the distance from the seed it was found from. > The depth is used to prioritise URLs in the generation step so that > shallower pages are fetched first. > </description> > </property> > > Considering > https://wiki.apache.org/nutch/NutchTutorial#Create_a_URL_seed_list > <https://wiki.apache.org/nutch/NutchTutorial#Create_a_URL_seed_list> now the > question is: > How exactely should "_maxdepth=2” as seed metadata be (where) specified, so > that the “depth” be specified for (or before) each NUTCH run (instead of > beeing changed in the properties) ? > > Best > Fabio > > > >> On 11 Apr 2017, at 22:26, Sebastian Nagel <[email protected]> wrote: >> >> Hi, >> >> "generate.max.distance" is for Nutch 2.x, for Nutch 1.x there is the plugin >> scoring-depth: >> - add it to the property "plugin.includes" >> - configure the property "scoring.depth.max" >> >> But depth and cycles/rounds are equivalent if topN is large. During the >> first cycle all seeds (depth >> 1) are fetched, the second cycle fetches all links of depth 2, and so on. >> Only if there are more >> URLs to fetch than topN, you get a different behavior for depth and cycles. >> >>>> Maybe I should use a lower NUTCH version (which) ? >> 1.13 is a good choice. >> >> Best, >> Sebastian >> >> On 04/11/2017 03:39 PM, Ben Vachon wrote: >>> Hi Fabio, >>> >>> I believe there is a property generate.max.distance in nutch-site.xml in >>> the newest releases that >>> you can use to configure max depth. >>> >>> >>> On 04/11/2017 06:20 AM, Fabio Ricci wrote: >>>> Hi Sebastian >>>> >>>> thank you for your message. That does not help me really… >>>> >>>> Yes I new the output of ./crawl without parameters (Synopsis) - but even >>>> then there are some >>>> assumptions, only an insider can understand. And under >>>> https://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search >>>> <https://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search> >>>> there is an indication to >>>> use it like I tried. >>>> >>>> Num Rounds is not a Depth. A Depth is the depth in traversing links >>>> starting from Seed. >>>> I admit I feel overwhelmed by all that parameters which in my case do not >>>> help me… >>>> >>>> I just need a tool which navigate using a seed url inside a certain depth. >>>> Do not need topN >>>> parameters … >>>> >>>> Maybe I should use a lower NUTCH version (which) ? >>>> >>>> ... >>>> >>>> Thanks >>>> Fabio >>>> >>>> >>>>> On 11 Apr 2017, at 10:26, Sebastian Nagel <[email protected]> >>>>> wrote: >>>>> >>>>> Hi Fabio, >>>>> >>>>> only Java/Hadoop properties can be passed via -D... >>>>> >>>>> Command-line parameters (such as -topN) cannot be passed to Nutch >>>>> tools/steps this way, see: >>>>> >>>>> % bin/crawl >>>>> Usage: crawl [-i|--index] [-D "key=value"] [-w|--wait] <Seed Dir> <Crawl >>>>> Dir> <Num Rounds> >>>>> -i|--index Indexes crawl results into a configured indexer >>>>> -D A Java property to pass to Nutch calls >>>>> -w|--wait NUMBER[SUFFIX] Time to wait before generating a >>>>> new segment when no URLs >>>>> are scheduled for fetching. Suffix can be: s for >>>>> second, >>>>> m for minute, h for hour and d for day. If no >>>>> suffix is >>>>> specified second is used by default. >>>>> Seed Dir Directory in which to look for a seeds file >>>>> Crawl Dir Directory where the crawl/link/segments dirs are >>>>> saved >>>>> Num Rounds The number of rounds to run this crawl for >>>>> >>>>> In case of -topN : you need to modify bin/crawl (that's easy to do). >>>>> There are also other ways to >>>>> limit the length of the fetch list (see, e.g., "generate.max.count"). >>>>> >>>>> Regarding -depth : I suppose that's the same as <Num Rounds> >>>>> >>>>> Best, >>>>> Sebastian >>>>> >>>>> On 04/11/2017 01:12 AM, Fabio Ricci wrote: >>>>>> Hello >>>>>> >>>>>> I am a newbie in NUTCH and I need a crawler in order to fetch some urls >>>>>> inside a given depth and >>>>>> to index dound pages into SOLR 6.5 . >>>>>> >>>>>> On my OSX I got NUTCH running. I was hoping to use it directly for >>>>>> indexing. Instead I am >>>>>> wondering why the script /runtime/local/bin/crawl does not pass the >>>>>> depth and topN parameter to >>>>>> the software. >>>>>> >>>>>> Expetially I use the following example call: >>>>>> >>>>>> ./bin/crawl -i -D solr.server.url=http://127.0.0.1:8983/solr/demo/ -D >>>>>> depth=2 -D topN=2 >>>>>> /Users/fabio/NUTCH/urls/ /Users/fabio/NUTCH/crawl/ 1 >>>>>> >>>>>> With one single url inside /ursl/seed.txt >>>>>> >>>>>> Expecting the crawling process will go into max depth = 2. >>>>>> >>>>>> Instead, it runs and runs … and I suppose something runs >>>>>> ***differently*** as described. >>>>>> >>>>>> For example I noticed in the output the following text (this is just a >>>>>> segment, the output "does >>>>>> not stop"): >>>>>> >>>>>> Injecting seed URLs >>>>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch inject >>>>>> /Users/fabio/NUTCH/crawl//crawldb /Users/fabio/NUTCH/urls/ >>>>>> Injector: starting at 2017-04-11 00:54:56 >>>>>> Injector: crawlDb: /Users/fabio/NUTCH/crawl/crawldb >>>>>> Injector: urlDir: /Users/fabio/NUTCH/urls >>>>>> Injector: Converting injected urls to crawl db entries. >>>>>> Injector: overwrite: false >>>>>> Injector: update: false >>>>>> Injector: Total urls rejected by filters: 0 >>>>>> Injector: Total urls injected after normalization and filtering: 1 >>>>>> Injector: Total urls injected but already in CrawlDb: 1 >>>>>> Injector: Total new urls injected: 0 >>>>>> Injector: finished at 2017-04-11 00:54:58, elapsed: 00:00:01 >>>>>> Tue Apr 11 00:54:58 CEST 2017 : Iteration 1 of 1 >>>>>> Generating a new segment >>>>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch generate >>>>>> -D >>>>>> mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D >>>>>> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D >>>>>> mapreduce.map.output.compress=true /Users/fabio/NUTCH/crawl//crawldb >>>>>> /Users/fabio/NUTCH/crawl//segments -topN 50000 -numFetchers 1 -noFilter >>>>>> Generator: starting at 2017-04-11 00:54:59 >>>>>> Generator: Selecting best-scoring urls due for fetch. >>>>>> Generator: filtering: false >>>>>> Generator: normalizing: true >>>>>> Generator: topN: 50000 >>>>>> Generator: Partitioning selected urls for politeness. >>>>>> Generator: segment: /Users/fabio/NUTCH/crawl/segments/20170411005501 >>>>>> Generator: finished at 2017-04-11 00:55:02, elapsed: 00:00:03 >>>>>> Operating on segment : 20170411005501 >>>>>> Fetching : 20170411005501 >>>>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch fetch -D >>>>>> mapreduce.job.reduces=2 >>>>>> -D mapred.child.java.opts=-Xmx1000m -D >>>>>> mapreduce.reduce.speculative=false -D >>>>>> mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D >>>>>> fetcher.timelimit.mins=180 >>>>>> /Users/fabio/NUTCH/crawl//segments/20170411005501 -noParsing -threads 50 >>>>>> >>>>>> Here - although I am a newbie - I notice that there is one line saying >>>>>> “Generator: topN: 50000” >>>>>> - slightely more than -D topN=2 … and there are no indications on the >>>>>> depth. So this nice script >>>>>> /bin/crawl seems not to pass the -D parameters to the Java application. >>>>>> And maybe not even the >>>>>> solr.server-url value … >>>>>> >>>>>> Googling for “depth” finds a lot of explanations on the deprecated form >>>>>> /bin/nutch crawl -depth, >>>>>> … etc… so I feel a little confused and need help. >>>>>> >>>>>> What is wrong with my call example above please? >>>>>> >>>>>> Thank you for any hint which can help me understanging why the -D >>>>>> parameters are not passed. >>>>>> >>>>>> Regards >>>>>> Fabio Ricci >>>>>> >>>>>> >>>> >>> >> > >

