Hi Fabio,
only Java/Hadoop properties can be passed via -D...
Command-line parameters (such as -topN) cannot be passed to Nutch tools/steps
this way, see:
% bin/crawl
Usage: crawl [-i|--index] [-D "key=value"] [-w|--wait] <Seed Dir> <Crawl Dir>
<Num Rounds>
-i|--index Indexes crawl results into a configured indexer
-D A Java property to pass to Nutch calls
-w|--wait NUMBER[SUFFIX] Time to wait before generating a new
segment when no URLs
are scheduled for fetching. Suffix can be: s for second,
m for minute, h for hour and d for day. If no suffix is
specified second is used by default.
Seed Dir Directory in which to look for a seeds file
Crawl Dir Directory where the crawl/link/segments dirs are saved
Num Rounds The number of rounds to run this crawl for
In case of -topN : you need to modify bin/crawl (that's easy to do). There are
also other ways to
limit the length of the fetch list (see, e.g., "generate.max.count").
Regarding -depth : I suppose that's the same as <Num Rounds>
Best,
Sebastian
On 04/11/2017 01:12 AM, Fabio Ricci wrote:
> Hello
>
> I am a newbie in NUTCH and I need a crawler in order to fetch some urls
> inside a given depth and to index dound pages into SOLR 6.5 .
>
> On my OSX I got NUTCH running. I was hoping to use it directly for indexing.
> Instead I am wondering why the script /runtime/local/bin/crawl does not pass
> the depth and topN parameter to the software.
>
> Expetially I use the following example call:
>
> ./bin/crawl -i -D solr.server.url=http://127.0.0.1:8983/solr/demo/ -D depth=2
> -D topN=2 /Users/fabio/NUTCH/urls/ /Users/fabio/NUTCH/crawl/ 1
>
> With one single url inside /ursl/seed.txt
>
> Expecting the crawling process will go into max depth = 2.
>
> Instead, it runs and runs … and I suppose something runs ***differently*** as
> described.
>
> For example I noticed in the output the following text (this is just a
> segment, the output "does not stop"):
>
> Injecting seed URLs
> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch inject
> /Users/fabio/NUTCH/crawl//crawldb /Users/fabio/NUTCH/urls/
> Injector: starting at 2017-04-11 00:54:56
> Injector: crawlDb: /Users/fabio/NUTCH/crawl/crawldb
> Injector: urlDir: /Users/fabio/NUTCH/urls
> Injector: Converting injected urls to crawl db entries.
> Injector: overwrite: false
> Injector: update: false
> Injector: Total urls rejected by filters: 0
> Injector: Total urls injected after normalization and filtering: 1
> Injector: Total urls injected but already in CrawlDb: 1
> Injector: Total new urls injected: 0
> Injector: finished at 2017-04-11 00:54:58, elapsed: 00:00:01
> Tue Apr 11 00:54:58 CEST 2017 : Iteration 1 of 1
> Generating a new segment
> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch generate -D
> mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
> mapreduce.map.output.compress=true /Users/fabio/NUTCH/crawl//crawldb
> /Users/fabio/NUTCH/crawl//segments -topN 50000 -numFetchers 1 -noFilter
> Generator: starting at 2017-04-11 00:54:59
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: false
> Generator: normalizing: true
> Generator: topN: 50000
> Generator: Partitioning selected urls for politeness.
> Generator: segment: /Users/fabio/NUTCH/crawl/segments/20170411005501
> Generator: finished at 2017-04-11 00:55:02, elapsed: 00:00:03
> Operating on segment : 20170411005501
> Fetching : 20170411005501
> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch fetch -D
> mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
> mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180
> /Users/fabio/NUTCH/crawl//segments/20170411005501 -noParsing -threads 50
>
> Here - although I am a newbie - I notice that there is one line saying
> “Generator: topN: 50000” - slightely more than -D topN=2 … and there are no
> indications on the depth. So this nice script /bin/crawl seems not to pass
> the -D parameters to the Java application. And maybe not even the
> solr.server-url value …
>
> Googling for “depth” finds a lot of explanations on the deprecated form
> /bin/nutch crawl -depth, … etc… so I feel a little confused and need help.
>
> What is wrong with my call example above please?
>
> Thank you for any hint which can help me understanging why the -D parameters
> are not passed.
>
> Regards
> Fabio Ricci
>
>