Then, how we set nutch options topN, depth, etc in crawl script? Thanks.-
On Fri, Nov 1, 2013 at 9:05 PM, Sebastian Nagel <[email protected]>wrote: > Hi Bayu, > > the short answer is: no. > > The detailed answer: > > NUTCH_OPTS is used to pass arguments to the Java VM. > This also included Java system properties. > You could define Nutch/Hadoop configuration properties using variables > which > then will be substituted by Java system properties, e.g. > <property> > <name>my.prop</name> > <value>${my.prop}</name> > </property> > Setting > NUTCH_OPTS="-Xmx2048m -Dmy.prop=XYZ" > would allow to pass the desired value XYZ via command-line or NUTCH_OPTS. > See > https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/conf/Configuration.html > for details about variable substitution. > > One remark: You may have seen commands like > bin/nutch org.apache.nutch.indexer.IndexingJob -Dsolr.server.url="..." ... > That's no a system property because the argument "-D..." comes after > the class to be run. Most (if not all) Nutch tools/commands use > ToolRunner.run() > which supports generic options (among them -Dproperty=value). > > Sebastian > > On 11/01/2013 12:54 AM, Bayu Widyasanyata wrote: > > Hi, > > > > One more question for NUTCH_OPTS. > > Is it only for Java additional options or we could pass any nutch options > > e.g. topN, depth, etc.? > > > > Since I couldn't find more tutorial on crawl script instead of mentioned > > here [1]. > > > > [1] > http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script > > > > > > On Thu, Oct 31, 2013 at 8:43 PM, Bayu Widyasanyata > > <[email protected]>wrote: > > > >> Hi Sebastian, > >> > >> Thanks for the hint. > >> > >> --- > >> wassalam, > >> [bayu] > >> > >> /sent from Android phone/ > >> On Oct 30, 2013 7:54 PM, "Sebastian Nagel" <[email protected]> > >> wrote: > >> > >>> Hi, > >>> > >>> the script bin/crawl executes bin/nutch for every step (inject, fetch, > >>> etc.). > >>> > >>> bin/nutch makes use of two environment variables (see comments in > >>> bin/nutch > >>> ): > >>> NUTCH_HEAPSIZE (in MB) > >>> NUTCH_OPTS Extra Java runtime options > >>> > >>> export NUTCH_HEAPSIZE=2048 > >>> should work but also > >>> export NUTCH_OPTS="-Xmx2048m" > >>> > >>> The latter one would allow to add more Java options separated by space. > >>> > >>> Sebastian > >>> > >>> > >>> 2013/10/30 Bayu Widyasanyata <[email protected]> > >>> > >>>> Hi All, > >>>> > >>>> When I ran crawl script [1] (not nutch's crawl), I got hava OOM heap > >>> space: > >>>> > >>>> 2013-10-29 12:56:25,407 WARN mapred.LocalJobRunner - > >>>> job_local1484958909_0001 > >>>> > >>>> java.lang.Exception: java.lang.OutOfMemoryError: Java heap space > >>>> > >>>> at > >>>> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) > >>>> > >>>> Caused by: java.lang.OutOfMemoryError: Java heap space > >>>> > >>>> at > >>>> org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:344) > >>>> > >>>> at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:406) > >>>> > >>>> at > org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:238) > >>>> > >>>> at > >>>> > >>>> > >>> > org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:348) > >>>> > >>>> at > >>> org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:368) > >>>> > >>>> at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:156) > >>>> > >>>> at > >>> org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:517) > >>>> > >>>> at > >>> org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:399) > >>>> > >>>> at org.apache.hadoop.mapred.Merger.merge(Merger.java:77) > >>>> > >>>> at > >>>> > >>>> > >>> > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1698) > >>>> > >>>> at > >>>> > >>> > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1328) > >>>> > >>>> at > >>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:431) > >>>> > >>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366) > >>>> > >>>> at > >>>> > >>>> > >>> > org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) > >>>> > >>>> at > >>>> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > >>>> > >>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262) > >>>> > >>>> at > >>>> > >>>> > >>> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > >>>> > >>>> at > >>>> > >>>> > >>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > >>>> > >>>> at java.lang.Thread.run(Thread.java:744) > >>>> > >>>> 2013-10-29 12:56:25,787 ERROR fetcher.Fetcher - Fetcher: > >>>> java.io.IOException: Job failed! > >>>> > >>>> at > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) > >>>> > >>>> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1340) > >>>> > >>>> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1376) > >>>> > >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > >>>> > >>>> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1349) > >>>> > >>>> I use nutch 1.7 and JDK 1.7.0.45. > >>>> > >>>> How to put java max heap size on crawl script? (-Xmx option)? > >>>> > >>>> Thanks in advance.- > >>>> > >>>> [1] > >>>> > http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script > >>>> > >>>> -- > >>>> wassalam, > >>>> [bayu] > >>>> > >>> > >> > > > > > > -- wassalam, [bayu]

