Hi, arguments are: bin/crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds> - depth is numberOfRounds - topN is fix (50000, see var sizeFetchlist), you have to modify bin/crawl to your needs. That's explicitly intended, bin/crawl is more an example and by far not an all-purpose tool. - etc.: it depends: if set by command-line arguments to bin/nutch commands, modify bin/crawl. Otherwise you have to set the corresponding properties in nutch-site.xml
Sebastian On 11/03/2013 02:44 PM, Bayu Widyasanyata wrote: > Then, how we set nutch options topN, depth, etc in crawl script? > > Thanks.- > > > On Fri, Nov 1, 2013 at 9:05 PM, Sebastian Nagel > <[email protected]>wrote: > >> Hi Bayu, >> >> the short answer is: no. >> >> The detailed answer: >> >> NUTCH_OPTS is used to pass arguments to the Java VM. >> This also included Java system properties. >> You could define Nutch/Hadoop configuration properties using variables >> which >> then will be substituted by Java system properties, e.g. >> <property> >> <name>my.prop</name> >> <value>${my.prop}</name> >> </property> >> Setting >> NUTCH_OPTS="-Xmx2048m -Dmy.prop=XYZ" >> would allow to pass the desired value XYZ via command-line or NUTCH_OPTS. >> See >> https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/conf/Configuration.html >> for details about variable substitution. >> >> One remark: You may have seen commands like >> bin/nutch org.apache.nutch.indexer.IndexingJob -Dsolr.server.url="..." ... >> That's no a system property because the argument "-D..." comes after >> the class to be run. Most (if not all) Nutch tools/commands use >> ToolRunner.run() >> which supports generic options (among them -Dproperty=value). >> >> Sebastian >> >> On 11/01/2013 12:54 AM, Bayu Widyasanyata wrote: >>> Hi, >>> >>> One more question for NUTCH_OPTS. >>> Is it only for Java additional options or we could pass any nutch options >>> e.g. topN, depth, etc.? >>> >>> Since I couldn't find more tutorial on crawl script instead of mentioned >>> here [1]. >>> >>> [1] >> http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script >>> >>> >>> On Thu, Oct 31, 2013 at 8:43 PM, Bayu Widyasanyata >>> <[email protected]>wrote: >>> >>>> Hi Sebastian, >>>> >>>> Thanks for the hint. >>>> >>>> --- >>>> wassalam, >>>> [bayu] >>>> >>>> /sent from Android phone/ >>>> On Oct 30, 2013 7:54 PM, "Sebastian Nagel" <[email protected]> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> the script bin/crawl executes bin/nutch for every step (inject, fetch, >>>>> etc.). >>>>> >>>>> bin/nutch makes use of two environment variables (see comments in >>>>> bin/nutch >>>>> ): >>>>> NUTCH_HEAPSIZE (in MB) >>>>> NUTCH_OPTS Extra Java runtime options >>>>> >>>>> export NUTCH_HEAPSIZE=2048 >>>>> should work but also >>>>> export NUTCH_OPTS="-Xmx2048m" >>>>> >>>>> The latter one would allow to add more Java options separated by space. >>>>> >>>>> Sebastian >>>>> >>>>> >>>>> 2013/10/30 Bayu Widyasanyata <[email protected]> >>>>> >>>>>> Hi All, >>>>>> >>>>>> When I ran crawl script [1] (not nutch's crawl), I got hava OOM heap >>>>> space: >>>>>> >>>>>> 2013-10-29 12:56:25,407 WARN mapred.LocalJobRunner - >>>>>> job_local1484958909_0001 >>>>>> >>>>>> java.lang.Exception: java.lang.OutOfMemoryError: Java heap space >>>>>> >>>>>> at >>>>>> >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) >>>>>> >>>>>> Caused by: java.lang.OutOfMemoryError: Java heap space >>>>>> >>>>>> at >>>>>> org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:344) >>>>>> >>>>>> at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:406) >>>>>> >>>>>> at >> org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:238) >>>>>> >>>>>> at >>>>>> >>>>>> >>>>> >> org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:348) >>>>>> >>>>>> at >>>>> org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:368) >>>>>> >>>>>> at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:156) >>>>>> >>>>>> at >>>>> org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:517) >>>>>> >>>>>> at >>>>> org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:399) >>>>>> >>>>>> at org.apache.hadoop.mapred.Merger.merge(Merger.java:77) >>>>>> >>>>>> at >>>>>> >>>>>> >>>>> >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1698) >>>>>> >>>>>> at >>>>>> >>>>> >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1328) >>>>>> >>>>>> at >>>>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:431) >>>>>> >>>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366) >>>>>> >>>>>> at >>>>>> >>>>>> >>>>> >> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) >>>>>> >>>>>> at >>>>>> >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >>>>>> >>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262) >>>>>> >>>>>> at >>>>>> >>>>>> >>>>> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>>>>> >>>>>> at >>>>>> >>>>>> >>>>> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>>>> >>>>>> at java.lang.Thread.run(Thread.java:744) >>>>>> >>>>>> 2013-10-29 12:56:25,787 ERROR fetcher.Fetcher - Fetcher: >>>>>> java.io.IOException: Job failed! >>>>>> >>>>>> at >> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) >>>>>> >>>>>> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1340) >>>>>> >>>>>> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1376) >>>>>> >>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>>>> >>>>>> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1349) >>>>>> >>>>>> I use nutch 1.7 and JDK 1.7.0.45. >>>>>> >>>>>> How to put java max heap size on crawl script? (-Xmx option)? >>>>>> >>>>>> Thanks in advance.- >>>>>> >>>>>> [1] >>>>>> >> http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script >>>>>> >>>>>> -- >>>>>> wassalam, >>>>>> [bayu] >>>>>> >>>>> >>>> >>> >>> >> >> > >

