Hello I am a newbie in NUTCH and I need a crawler in order to fetch some urls inside a given depth and to index dound pages into SOLR 6.5 .
On my OSX I got NUTCH running. I was hoping to use it directly for indexing. Instead I am wondering why the script /runtime/local/bin/crawl does not pass the depth and topN parameter to the software. Expetially I use the following example call: ./bin/crawl -i -D solr.server.url=http://127.0.0.1:8983/solr/demo/ -D depth=2 -D topN=2 /Users/fabio/NUTCH/urls/ /Users/fabio/NUTCH/crawl/ 1 With one single url inside /ursl/seed.txt Expecting the crawling process will go into max depth = 2. Instead, it runs and runs … and I suppose something runs ***differently*** as described. For example I noticed in the output the following text (this is just a segment, the output "does not stop"): Injecting seed URLs /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch inject /Users/fabio/NUTCH/crawl//crawldb /Users/fabio/NUTCH/urls/ Injector: starting at 2017-04-11 00:54:56 Injector: crawlDb: /Users/fabio/NUTCH/crawl/crawldb Injector: urlDir: /Users/fabio/NUTCH/urls Injector: Converting injected urls to crawl db entries. Injector: overwrite: false Injector: update: false Injector: Total urls rejected by filters: 0 Injector: Total urls injected after normalization and filtering: 1 Injector: Total urls injected but already in CrawlDb: 1 Injector: Total new urls injected: 0 Injector: finished at 2017-04-11 00:54:58, elapsed: 00:00:01 Tue Apr 11 00:54:58 CEST 2017 : Iteration 1 of 1 Generating a new segment /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch generate -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true /Users/fabio/NUTCH/crawl//crawldb /Users/fabio/NUTCH/crawl//segments -topN 50000 -numFetchers 1 -noFilter Generator: starting at 2017-04-11 00:54:59 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: false Generator: normalizing: true Generator: topN: 50000 Generator: Partitioning selected urls for politeness. Generator: segment: /Users/fabio/NUTCH/crawl/segments/20170411005501 Generator: finished at 2017-04-11 00:55:02, elapsed: 00:00:03 Operating on segment : 20170411005501 Fetching : 20170411005501 /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch fetch -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 /Users/fabio/NUTCH/crawl//segments/20170411005501 -noParsing -threads 50 Here - although I am a newbie - I notice that there is one line saying “Generator: topN: 50000” - slightely more than -D topN=2 … and there are no indications on the depth. So this nice script /bin/crawl seems not to pass the -D parameters to the Java application. And maybe not even the solr.server-url value … Googling for “depth” finds a lot of explanations on the deprecated form /bin/nutch crawl -depth, … etc… so I feel a little confused and need help. What is wrong with my call example above please? Thank you for any hint which can help me understanging why the -D parameters are not passed. Regards Fabio Ricci

