Hi,

"generate.max.distance" is for Nutch 2.x, for Nutch 1.x there is the plugin 
scoring-depth:
- add it to the property "plugin.includes"
- configure the property "scoring.depth.max"

But depth and cycles/rounds are equivalent if topN is large. During the first 
cycle all seeds (depth
1) are fetched, the second cycle fetches all links of depth 2, and so on. Only 
if there are more
URLs to fetch than topN, you get a different behavior for depth and cycles.

>> Maybe I should use a lower NUTCH version (which) ?
1.13 is a good choice.

Best,
Sebastian

On 04/11/2017 03:39 PM, Ben Vachon wrote:
> Hi Fabio,
> 
> I believe there is a property generate.max.distance in nutch-site.xml in the 
> newest releases that
> you can use to configure max depth.
> 
> 
> On 04/11/2017 06:20 AM, Fabio Ricci wrote:
>> Hi Sebastian
>>
>> thank you for your message. That does not help me really…
>>
>> Yes I new the output of ./crawl without parameters (Synopsis) - but even 
>> then there are some
>> assumptions, only an insider can understand. And under
>> https://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search
>> <https://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search> 
>> there is an indication to
>> use it like I tried.
>>
>> Num Rounds is not a Depth. A Depth is the depth in traversing links starting 
>> from Seed.
>> I admit I feel overwhelmed by all that parameters which in my case do not 
>> help me…
>>
>> I just need a tool which navigate using a seed url inside a certain depth. 
>> Do not need topN
>> parameters …
>>
>> Maybe I should use a lower NUTCH version (which) ?
>>
>> ...
>>
>> Thanks
>> Fabio
>>
>>
>>> On 11 Apr 2017, at 10:26, Sebastian Nagel <[email protected]> 
>>> wrote:
>>>
>>> Hi Fabio,
>>>
>>> only Java/Hadoop properties can be passed via -D...
>>>
>>> Command-line parameters (such as -topN) cannot be passed to Nutch 
>>> tools/steps this way, see:
>>>
>>> % bin/crawl
>>> Usage: crawl [-i|--index] [-D "key=value"] [-w|--wait] <Seed Dir> <Crawl 
>>> Dir> <Num Rounds>
>>>         -i|--index      Indexes crawl results into a configured indexer
>>>         -D              A Java property to pass to Nutch calls
>>>         -w|--wait       NUMBER[SUFFIX] Time to wait before generating a new 
>>> segment when no URLs
>>>                         are scheduled for fetching. Suffix can be: s for 
>>> second,
>>>                         m for minute, h for hour and d for day. If no 
>>> suffix is
>>>                         specified second is used by default.
>>>         Seed Dir        Directory in which to look for a seeds file
>>>         Crawl Dir       Directory where the crawl/link/segments dirs are 
>>> saved
>>>         Num Rounds      The number of rounds to run this crawl for
>>>
>>> In case of -topN : you need to modify bin/crawl (that's easy to do). There 
>>> are also other ways to
>>> limit the length of the fetch list (see, e.g., "generate.max.count").
>>>
>>> Regarding -depth : I suppose that's the same as <Num Rounds>
>>>
>>> Best,
>>> Sebastian
>>>
>>> On 04/11/2017 01:12 AM, Fabio Ricci wrote:
>>>> Hello
>>>>
>>>> I am a newbie in NUTCH and I need a crawler in order to fetch some urls 
>>>> inside a given depth and
>>>> to index dound pages into SOLR 6.5 .
>>>>
>>>> On my OSX I got NUTCH running. I was hoping to use it directly for 
>>>> indexing. Instead I am
>>>> wondering why the script /runtime/local/bin/crawl does not pass the depth 
>>>> and topN parameter to
>>>> the software.
>>>>
>>>> Expetially I use the following example call:
>>>>
>>>> ./bin/crawl -i -D solr.server.url=http://127.0.0.1:8983/solr/demo/ -D 
>>>> depth=2 -D topN=2
>>>> /Users/fabio/NUTCH/urls/ /Users/fabio/NUTCH/crawl/ 1
>>>>
>>>> With one single url inside /ursl/seed.txt
>>>>
>>>> Expecting the crawling process will go into max depth = 2.
>>>>
>>>> Instead, it runs and runs … and I suppose something runs ***differently*** 
>>>> as described.
>>>>
>>>> For example I noticed in the output the following text (this is just a 
>>>> segment, the output "does
>>>> not stop"):
>>>>
>>>> Injecting seed URLs
>>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch inject
>>>> /Users/fabio/NUTCH/crawl//crawldb /Users/fabio/NUTCH/urls/
>>>> Injector: starting at 2017-04-11 00:54:56
>>>> Injector: crawlDb: /Users/fabio/NUTCH/crawl/crawldb
>>>> Injector: urlDir: /Users/fabio/NUTCH/urls
>>>> Injector: Converting injected urls to crawl db entries.
>>>> Injector: overwrite: false
>>>> Injector: update: false
>>>> Injector: Total urls rejected by filters: 0
>>>> Injector: Total urls injected after normalization and filtering: 1
>>>> Injector: Total urls injected but already in CrawlDb: 1
>>>> Injector: Total new urls injected: 0
>>>> Injector: finished at 2017-04-11 00:54:58, elapsed: 00:00:01
>>>> Tue Apr 11 00:54:58 CEST 2017 : Iteration 1 of 1
>>>> Generating a new segment
>>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch generate -D
>>>> mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D
>>>> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
>>>> mapreduce.map.output.compress=true /Users/fabio/NUTCH/crawl//crawldb
>>>> /Users/fabio/NUTCH/crawl//segments -topN 50000 -numFetchers 1 -noFilter
>>>> Generator: starting at 2017-04-11 00:54:59
>>>> Generator: Selecting best-scoring urls due for fetch.
>>>> Generator: filtering: false
>>>> Generator: normalizing: true
>>>> Generator: topN: 50000
>>>> Generator: Partitioning selected urls for politeness.
>>>> Generator: segment: /Users/fabio/NUTCH/crawl/segments/20170411005501
>>>> Generator: finished at 2017-04-11 00:55:02, elapsed: 00:00:03
>>>> Operating on segment : 20170411005501
>>>> Fetching : 20170411005501
>>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch fetch -D 
>>>> mapreduce.job.reduces=2
>>>> -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false 
>>>> -D
>>>> mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D
>>>> fetcher.timelimit.mins=180 
>>>> /Users/fabio/NUTCH/crawl//segments/20170411005501 -noParsing -threads 50
>>>>
>>>> Here - although I am a newbie - I notice that there is one line saying 
>>>> “Generator: topN: 50000”
>>>> - slightely more than -D topN=2 … and there are no indications on the 
>>>> depth. So this nice script
>>>> /bin/crawl seems not to pass the -D parameters to the Java application. 
>>>> And maybe not even the
>>>> solr.server-url value …
>>>>
>>>> Googling for “depth” finds a lot of explanations on the deprecated form 
>>>> /bin/nutch crawl -depth,
>>>> … etc… so I feel a little confused and need help.
>>>>
>>>> What is wrong with my call example above please?
>>>>
>>>> Thank you for any hint which can help me understanging why the -D 
>>>> parameters are not passed.
>>>>
>>>> Regards
>>>> Fabio Ricci
>>>>
>>>>
>>
> 

Reply via email to