Re: Nutch 1.13 @Sierra - Java -D parameters not passed to nutch

Fabio Ricci Tue, 11 Apr 2017 03:21:07 -0700

Hi Sebastian

thank you for your message. That does not help me really…


Yes I new the output of ./crawl without parameters (Synopsis) - but even then 
there are some assumptions, only an insider can understand. And under 
https://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search 
<https://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search> there 
is an indication to use it like I tried.

Num Rounds is not a Depth. A Depth is the depth in traversing links starting 
from Seed.
I admit I feel overwhelmed by all that parameters which in my case do not help 
me… 

I just need a tool which navigate using a seed url inside a certain depth. Do 
not need topN parameters …

Maybe I should use a lower NUTCH version (which) ?

...

Thanks
Fabio


> On 11 Apr 2017, at 10:26, Sebastian Nagel <[email protected]> wrote:
> 
> Hi Fabio,
> 
> only Java/Hadoop properties can be passed via -D...
> 
> Command-line parameters (such as -topN) cannot be passed to Nutch tools/steps 
> this way, see:
> 
> % bin/crawl
> Usage: crawl [-i|--index] [-D "key=value"] [-w|--wait] <Seed Dir> <Crawl Dir> 
> <Num Rounds>
>        -i|--index      Indexes crawl results into a configured indexer
>        -D              A Java property to pass to Nutch calls
>        -w|--wait       NUMBER[SUFFIX] Time to wait before generating a new 
> segment when no URLs
>                        are scheduled for fetching. Suffix can be: s for 
> second,
>                        m for minute, h for hour and d for day. If no suffix is
>                        specified second is used by default.
>        Seed Dir        Directory in which to look for a seeds file
>        Crawl Dir       Directory where the crawl/link/segments dirs are saved
>        Num Rounds      The number of rounds to run this crawl for
> 
> In case of -topN : you need to modify bin/crawl (that's easy to do). There 
> are also other ways to
> limit the length of the fetch list (see, e.g., "generate.max.count").
> 
> Regarding -depth : I suppose that's the same as <Num Rounds>
> 
> Best,
> Sebastian
> 
> On 04/11/2017 01:12 AM, Fabio Ricci wrote:
>> Hello
>> 
>> I am a newbie in NUTCH and I need a crawler in order to fetch some urls 
>> inside a given depth and to index dound pages into SOLR 6.5 .
>> 
>> On my OSX I got NUTCH running. I was hoping to use it directly for indexing. 
>> Instead I am wondering why the script /runtime/local/bin/crawl does not pass 
>> the depth and topN parameter to the software.
>> 
>> Expetially I use the following example call:
>> 
>> ./bin/crawl -i -D solr.server.url=http://127.0.0.1:8983/solr/demo/ -D 
>> depth=2 -D topN=2 /Users/fabio/NUTCH/urls/ /Users/fabio/NUTCH/crawl/ 1
>> 
>> With one single url inside /ursl/seed.txt
>> 
>> Expecting the crawling process will go into max depth = 2. 
>> 
>> Instead, it runs and runs … and I suppose something runs ***differently*** 
>> as described.
>> 
>> For example I noticed in the output the following text (this is just a 
>> segment, the output "does not stop"):
>> 
>> Injecting seed URLs
>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch inject 
>> /Users/fabio/NUTCH/crawl//crawldb /Users/fabio/NUTCH/urls/
>> Injector: starting at 2017-04-11 00:54:56
>> Injector: crawlDb: /Users/fabio/NUTCH/crawl/crawldb
>> Injector: urlDir: /Users/fabio/NUTCH/urls
>> Injector: Converting injected urls to crawl db entries.
>> Injector: overwrite: false
>> Injector: update: false
>> Injector: Total urls rejected by filters: 0
>> Injector: Total urls injected after normalization and filtering: 1
>> Injector: Total urls injected but already in CrawlDb: 1
>> Injector: Total new urls injected: 0
>> Injector: finished at 2017-04-11 00:54:58, elapsed: 00:00:01
>> Tue Apr 11 00:54:58 CEST 2017 : Iteration 1 of 1
>> Generating a new segment
>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch generate -D 
>> mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D 
>> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D 
>> mapreduce.map.output.compress=true /Users/fabio/NUTCH/crawl//crawldb 
>> /Users/fabio/NUTCH/crawl//segments -topN 50000 -numFetchers 1 -noFilter
>> Generator: starting at 2017-04-11 00:54:59
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: filtering: false
>> Generator: normalizing: true
>> Generator: topN: 50000
>> Generator: Partitioning selected urls for politeness.
>> Generator: segment: /Users/fabio/NUTCH/crawl/segments/20170411005501
>> Generator: finished at 2017-04-11 00:55:02, elapsed: 00:00:03
>> Operating on segment : 20170411005501
>> Fetching : 20170411005501
>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch fetch -D 
>> mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D 
>> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D 
>> mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 
>> /Users/fabio/NUTCH/crawl//segments/20170411005501 -noParsing -threads 50
>> 
>> Here - although I am a newbie - I notice that there is one line saying 
>> “Generator: topN: 50000” - slightely more than -D topN=2 … and there are no 
>> indications on the depth. So this nice script /bin/crawl seems not to pass 
>> the -D parameters to the Java application. And maybe not even the 
>> solr.server-url value … 
>> 
>> Googling for “depth” finds a lot of explanations on the deprecated form 
>> /bin/nutch crawl -depth, … etc… so I feel a little confused and need help.
>> 
>> What is wrong with my call example above please? 
>> 
>> Thank you for any hint which can help me understanging why the -D parameters 
>> are not passed.
>> 
>> Regards
>> Fabio Ricci
>> 
>> 
>

Re: Nutch 1.13 @Sierra - Java -D parameters not passed to nutch

Reply via email to