Hi Sebastian thank you for your message. That does not help me really…
Yes I new the output of ./crawl without parameters (Synopsis) - but even then there are some assumptions, only an insider can understand. And under https://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search <https://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search> there is an indication to use it like I tried. Num Rounds is not a Depth. A Depth is the depth in traversing links starting from Seed. I admit I feel overwhelmed by all that parameters which in my case do not help me… I just need a tool which navigate using a seed url inside a certain depth. Do not need topN parameters … Maybe I should use a lower NUTCH version (which) ? ... Thanks Fabio > On 11 Apr 2017, at 10:26, Sebastian Nagel <[email protected]> wrote: > > Hi Fabio, > > only Java/Hadoop properties can be passed via -D... > > Command-line parameters (such as -topN) cannot be passed to Nutch tools/steps > this way, see: > > % bin/crawl > Usage: crawl [-i|--index] [-D "key=value"] [-w|--wait] <Seed Dir> <Crawl Dir> > <Num Rounds> > -i|--index Indexes crawl results into a configured indexer > -D A Java property to pass to Nutch calls > -w|--wait NUMBER[SUFFIX] Time to wait before generating a new > segment when no URLs > are scheduled for fetching. Suffix can be: s for > second, > m for minute, h for hour and d for day. If no suffix is > specified second is used by default. > Seed Dir Directory in which to look for a seeds file > Crawl Dir Directory where the crawl/link/segments dirs are saved > Num Rounds The number of rounds to run this crawl for > > In case of -topN : you need to modify bin/crawl (that's easy to do). There > are also other ways to > limit the length of the fetch list (see, e.g., "generate.max.count"). > > Regarding -depth : I suppose that's the same as <Num Rounds> > > Best, > Sebastian > > On 04/11/2017 01:12 AM, Fabio Ricci wrote: >> Hello >> >> I am a newbie in NUTCH and I need a crawler in order to fetch some urls >> inside a given depth and to index dound pages into SOLR 6.5 . >> >> On my OSX I got NUTCH running. I was hoping to use it directly for indexing. >> Instead I am wondering why the script /runtime/local/bin/crawl does not pass >> the depth and topN parameter to the software. >> >> Expetially I use the following example call: >> >> ./bin/crawl -i -D solr.server.url=http://127.0.0.1:8983/solr/demo/ -D >> depth=2 -D topN=2 /Users/fabio/NUTCH/urls/ /Users/fabio/NUTCH/crawl/ 1 >> >> With one single url inside /ursl/seed.txt >> >> Expecting the crawling process will go into max depth = 2. >> >> Instead, it runs and runs … and I suppose something runs ***differently*** >> as described. >> >> For example I noticed in the output the following text (this is just a >> segment, the output "does not stop"): >> >> Injecting seed URLs >> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch inject >> /Users/fabio/NUTCH/crawl//crawldb /Users/fabio/NUTCH/urls/ >> Injector: starting at 2017-04-11 00:54:56 >> Injector: crawlDb: /Users/fabio/NUTCH/crawl/crawldb >> Injector: urlDir: /Users/fabio/NUTCH/urls >> Injector: Converting injected urls to crawl db entries. >> Injector: overwrite: false >> Injector: update: false >> Injector: Total urls rejected by filters: 0 >> Injector: Total urls injected after normalization and filtering: 1 >> Injector: Total urls injected but already in CrawlDb: 1 >> Injector: Total new urls injected: 0 >> Injector: finished at 2017-04-11 00:54:58, elapsed: 00:00:01 >> Tue Apr 11 00:54:58 CEST 2017 : Iteration 1 of 1 >> Generating a new segment >> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch generate -D >> mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D >> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D >> mapreduce.map.output.compress=true /Users/fabio/NUTCH/crawl//crawldb >> /Users/fabio/NUTCH/crawl//segments -topN 50000 -numFetchers 1 -noFilter >> Generator: starting at 2017-04-11 00:54:59 >> Generator: Selecting best-scoring urls due for fetch. >> Generator: filtering: false >> Generator: normalizing: true >> Generator: topN: 50000 >> Generator: Partitioning selected urls for politeness. >> Generator: segment: /Users/fabio/NUTCH/crawl/segments/20170411005501 >> Generator: finished at 2017-04-11 00:55:02, elapsed: 00:00:03 >> Operating on segment : 20170411005501 >> Fetching : 20170411005501 >> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch fetch -D >> mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D >> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D >> mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 >> /Users/fabio/NUTCH/crawl//segments/20170411005501 -noParsing -threads 50 >> >> Here - although I am a newbie - I notice that there is one line saying >> “Generator: topN: 50000” - slightely more than -D topN=2 … and there are no >> indications on the depth. So this nice script /bin/crawl seems not to pass >> the -D parameters to the Java application. And maybe not even the >> solr.server-url value … >> >> Googling for “depth” finds a lot of explanations on the deprecated form >> /bin/nutch crawl -depth, … etc… so I feel a little confused and need help. >> >> What is wrong with my call example above please? >> >> Thank you for any hint which can help me understanging why the -D parameters >> are not passed. >> >> Regards >> Fabio Ricci >> >> >

