> It seems it is a science, not a tool :)))

... grown over time (since 2002) to cover many use case always with scale 
(millions, billions) in mind.

> How exactely should "_maxdepth=2” as seed metadata be (where) specified

In the seed URLs file separated by a tab character (\u0009):

http://example.com/     _maxdepth_=2

Best,
Sebastian


On 04/12/2017 10:13 AM, Fabio Ricci wrote:
> Dear Sebastian and Ben - thank you so far for your hints!
> It seems it is a science, not a tool :)))
> I was considering simply a graph (built upon URL’s) which is built 
> (=“injected” in the nutch universum) and explored with a radius (a depth).
> 
> Instead, and surely because of other considerations (mass crawling aspects) 
> there seem to be other control parameters which rather “approximate” this 
> simple concept … 
> 
> Anyway in nutch 1.13 /conf/nutch-site.xml - thanks to your kind hint - there 
> is a section like:
> 
> <property>
>   <name>scoring.depth.max</name>
>   <value>1000</value>
>   <description>Max depth value from seed allowed by default.
>   Can be overridden on a per-seed basis by specifying "_maxdepth_=VALUE"
>   as a seed metadata. This plugin adds a "_depth_" metadatum to the pages
>   to track the distance from the seed it was found from. 
>   The depth is used to prioritise URLs in the generation step so that
>   shallower pages are fetched first.
>   </description>
> </property>
> 
> Considering 
> https://wiki.apache.org/nutch/NutchTutorial#Create_a_URL_seed_list 
> <https://wiki.apache.org/nutch/NutchTutorial#Create_a_URL_seed_list> now the 
> question is: 
> How exactely should "_maxdepth=2” as seed metadata be (where) specified, so 
> that the “depth” be specified for (or before) each NUTCH run (instead of 
> beeing changed in the properties) ?
> 
> Best
> Fabio
> 
> 
> 
>> On 11 Apr 2017, at 22:26, Sebastian Nagel <[email protected]> wrote:
>>
>> Hi,
>>
>> "generate.max.distance" is for Nutch 2.x, for Nutch 1.x there is the plugin 
>> scoring-depth:
>> - add it to the property "plugin.includes"
>> - configure the property "scoring.depth.max"
>>
>> But depth and cycles/rounds are equivalent if topN is large. During the 
>> first cycle all seeds (depth
>> 1) are fetched, the second cycle fetches all links of depth 2, and so on. 
>> Only if there are more
>> URLs to fetch than topN, you get a different behavior for depth and cycles.
>>
>>>> Maybe I should use a lower NUTCH version (which) ?
>> 1.13 is a good choice.
>>
>> Best,
>> Sebastian
>>
>> On 04/11/2017 03:39 PM, Ben Vachon wrote:
>>> Hi Fabio,
>>>
>>> I believe there is a property generate.max.distance in nutch-site.xml in 
>>> the newest releases that
>>> you can use to configure max depth.
>>>
>>>
>>> On 04/11/2017 06:20 AM, Fabio Ricci wrote:
>>>> Hi Sebastian
>>>>
>>>> thank you for your message. That does not help me really…
>>>>
>>>> Yes I new the output of ./crawl without parameters (Synopsis) - but even 
>>>> then there are some
>>>> assumptions, only an insider can understand. And under
>>>> https://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search
>>>> <https://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search> 
>>>> there is an indication to
>>>> use it like I tried.
>>>>
>>>> Num Rounds is not a Depth. A Depth is the depth in traversing links 
>>>> starting from Seed.
>>>> I admit I feel overwhelmed by all that parameters which in my case do not 
>>>> help me…
>>>>
>>>> I just need a tool which navigate using a seed url inside a certain depth. 
>>>> Do not need topN
>>>> parameters …
>>>>
>>>> Maybe I should use a lower NUTCH version (which) ?
>>>>
>>>> ...
>>>>
>>>> Thanks
>>>> Fabio
>>>>
>>>>
>>>>> On 11 Apr 2017, at 10:26, Sebastian Nagel <[email protected]> 
>>>>> wrote:
>>>>>
>>>>> Hi Fabio,
>>>>>
>>>>> only Java/Hadoop properties can be passed via -D...
>>>>>
>>>>> Command-line parameters (such as -topN) cannot be passed to Nutch 
>>>>> tools/steps this way, see:
>>>>>
>>>>> % bin/crawl
>>>>> Usage: crawl [-i|--index] [-D "key=value"] [-w|--wait] <Seed Dir> <Crawl 
>>>>> Dir> <Num Rounds>
>>>>>        -i|--index      Indexes crawl results into a configured indexer
>>>>>        -D              A Java property to pass to Nutch calls
>>>>>        -w|--wait       NUMBER[SUFFIX] Time to wait before generating a 
>>>>> new segment when no URLs
>>>>>                        are scheduled for fetching. Suffix can be: s for 
>>>>> second,
>>>>>                        m for minute, h for hour and d for day. If no 
>>>>> suffix is
>>>>>                        specified second is used by default.
>>>>>        Seed Dir        Directory in which to look for a seeds file
>>>>>        Crawl Dir       Directory where the crawl/link/segments dirs are 
>>>>> saved
>>>>>        Num Rounds      The number of rounds to run this crawl for
>>>>>
>>>>> In case of -topN : you need to modify bin/crawl (that's easy to do). 
>>>>> There are also other ways to
>>>>> limit the length of the fetch list (see, e.g., "generate.max.count").
>>>>>
>>>>> Regarding -depth : I suppose that's the same as <Num Rounds>
>>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>> On 04/11/2017 01:12 AM, Fabio Ricci wrote:
>>>>>> Hello
>>>>>>
>>>>>> I am a newbie in NUTCH and I need a crawler in order to fetch some urls 
>>>>>> inside a given depth and
>>>>>> to index dound pages into SOLR 6.5 .
>>>>>>
>>>>>> On my OSX I got NUTCH running. I was hoping to use it directly for 
>>>>>> indexing. Instead I am
>>>>>> wondering why the script /runtime/local/bin/crawl does not pass the 
>>>>>> depth and topN parameter to
>>>>>> the software.
>>>>>>
>>>>>> Expetially I use the following example call:
>>>>>>
>>>>>> ./bin/crawl -i -D solr.server.url=http://127.0.0.1:8983/solr/demo/ -D 
>>>>>> depth=2 -D topN=2
>>>>>> /Users/fabio/NUTCH/urls/ /Users/fabio/NUTCH/crawl/ 1
>>>>>>
>>>>>> With one single url inside /ursl/seed.txt
>>>>>>
>>>>>> Expecting the crawling process will go into max depth = 2.
>>>>>>
>>>>>> Instead, it runs and runs … and I suppose something runs 
>>>>>> ***differently*** as described.
>>>>>>
>>>>>> For example I noticed in the output the following text (this is just a 
>>>>>> segment, the output "does
>>>>>> not stop"):
>>>>>>
>>>>>> Injecting seed URLs
>>>>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch inject
>>>>>> /Users/fabio/NUTCH/crawl//crawldb /Users/fabio/NUTCH/urls/
>>>>>> Injector: starting at 2017-04-11 00:54:56
>>>>>> Injector: crawlDb: /Users/fabio/NUTCH/crawl/crawldb
>>>>>> Injector: urlDir: /Users/fabio/NUTCH/urls
>>>>>> Injector: Converting injected urls to crawl db entries.
>>>>>> Injector: overwrite: false
>>>>>> Injector: update: false
>>>>>> Injector: Total urls rejected by filters: 0
>>>>>> Injector: Total urls injected after normalization and filtering: 1
>>>>>> Injector: Total urls injected but already in CrawlDb: 1
>>>>>> Injector: Total new urls injected: 0
>>>>>> Injector: finished at 2017-04-11 00:54:58, elapsed: 00:00:01
>>>>>> Tue Apr 11 00:54:58 CEST 2017 : Iteration 1 of 1
>>>>>> Generating a new segment
>>>>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch generate 
>>>>>> -D
>>>>>> mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D
>>>>>> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
>>>>>> mapreduce.map.output.compress=true /Users/fabio/NUTCH/crawl//crawldb
>>>>>> /Users/fabio/NUTCH/crawl//segments -topN 50000 -numFetchers 1 -noFilter
>>>>>> Generator: starting at 2017-04-11 00:54:59
>>>>>> Generator: Selecting best-scoring urls due for fetch.
>>>>>> Generator: filtering: false
>>>>>> Generator: normalizing: true
>>>>>> Generator: topN: 50000
>>>>>> Generator: Partitioning selected urls for politeness.
>>>>>> Generator: segment: /Users/fabio/NUTCH/crawl/segments/20170411005501
>>>>>> Generator: finished at 2017-04-11 00:55:02, elapsed: 00:00:03
>>>>>> Operating on segment : 20170411005501
>>>>>> Fetching : 20170411005501
>>>>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch fetch -D 
>>>>>> mapreduce.job.reduces=2
>>>>>> -D mapred.child.java.opts=-Xmx1000m -D 
>>>>>> mapreduce.reduce.speculative=false -D
>>>>>> mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D
>>>>>> fetcher.timelimit.mins=180 
>>>>>> /Users/fabio/NUTCH/crawl//segments/20170411005501 -noParsing -threads 50
>>>>>>
>>>>>> Here - although I am a newbie - I notice that there is one line saying 
>>>>>> “Generator: topN: 50000”
>>>>>> - slightely more than -D topN=2 … and there are no indications on the 
>>>>>> depth. So this nice script
>>>>>> /bin/crawl seems not to pass the -D parameters to the Java application. 
>>>>>> And maybe not even the
>>>>>> solr.server-url value …
>>>>>>
>>>>>> Googling for “depth” finds a lot of explanations on the deprecated form 
>>>>>> /bin/nutch crawl -depth,
>>>>>> … etc… so I feel a little confused and need help.
>>>>>>
>>>>>> What is wrong with my call example above please?
>>>>>>
>>>>>> Thank you for any hint which can help me understanging why the -D 
>>>>>> parameters are not passed.
>>>>>>
>>>>>> Regards
>>>>>> Fabio Ricci
>>>>>>
>>>>>>
>>>>
>>>
>>
> 
> 

Reply via email to