Re: How to set JVM heap size on crawl script?

Sebastian Nagel Fri, 01 Nov 2013 07:07:56 -0700

Hi Bayu,

the short answer is: no.


The detailed answer:

NUTCH_OPTS is used to pass arguments to the Java VM.
This also included Java system properties.
You could define Nutch/Hadoop configuration properties using variables which
then will be substituted by Java system properties, e.g.
  <property>
   <name>my.prop</name>
   <value>${my.prop}</name>
  </property>
Setting
 NUTCH_OPTS="-Xmx2048m -Dmy.prop=XYZ"
would allow to pass the desired value XYZ via command-line or NUTCH_OPTS.
See 
https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/conf/Configuration.html
for details about variable substitution.

One remark: You may have seen commands like
 bin/nutch org.apache.nutch.indexer.IndexingJob -Dsolr.server.url="..." ...
That's no a system property because the argument "-D..." comes after
the class to be run. Most (if not all) Nutch tools/commands use ToolRunner.run()
which supports generic options (among them -Dproperty=value).

Sebastian

On 11/01/2013 12:54 AM, Bayu Widyasanyata wrote:
> Hi,
> 
> One more question for NUTCH_OPTS.
> Is it only for Java additional options or we could pass any nutch options
> e.g. topN, depth, etc.?
> 
> Since I couldn't find more tutorial on crawl script instead of mentioned
> here [1].
> 
> [1] http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script
> 
> 
> On Thu, Oct 31, 2013 at 8:43 PM, Bayu Widyasanyata
> <[email protected]>wrote:
> 
>> Hi Sebastian,
>>
>> Thanks for the hint.
>>
>> ---
>> wassalam,
>> [bayu]
>>
>> /sent from Android phone/
>> On Oct 30, 2013 7:54 PM, "Sebastian Nagel" <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> the script bin/crawl executes bin/nutch for every step (inject, fetch,
>>> etc.).
>>>
>>> bin/nutch makes use of two environment variables (see comments in
>>> bin/nutch
>>> ):
>>>  NUTCH_HEAPSIZE  (in MB)
>>>  NUTCH_OPTS         Extra Java runtime options
>>>
>>>  export NUTCH_HEAPSIZE=2048
>>> should work but also
>>>  export NUTCH_OPTS="-Xmx2048m"
>>>
>>> The latter one would allow to add more Java options separated by space.
>>>
>>> Sebastian
>>>
>>>
>>> 2013/10/30 Bayu Widyasanyata <[email protected]>
>>>
>>>> Hi All,
>>>>
>>>> When I ran crawl script [1] (not nutch's crawl), I got hava OOM heap
>>> space:
>>>>
>>>> 2013-10-29 12:56:25,407 WARN  mapred.LocalJobRunner -
>>>> job_local1484958909_0001
>>>>
>>>> java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
>>>>
>>>>        at
>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
>>>>
>>>> Caused by: java.lang.OutOfMemoryError: Java heap space
>>>>
>>>>        at
>>>> org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:344)
>>>>
>>>>        at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:406)
>>>>
>>>>        at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:238)
>>>>
>>>>        at
>>>>
>>>>
>>> org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:348)
>>>>
>>>>        at
>>> org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:368)
>>>>
>>>>        at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:156)
>>>>
>>>>        at
>>> org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:517)
>>>>
>>>>        at
>>> org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:399)
>>>>
>>>>        at org.apache.hadoop.mapred.Merger.merge(Merger.java:77)
>>>>
>>>>        at
>>>>
>>>>
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1698)
>>>>
>>>>        at
>>>>
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1328)
>>>>
>>>>        at
>>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:431)
>>>>
>>>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
>>>>
>>>>        at
>>>>
>>>>
>>> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
>>>>
>>>>        at
>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>>>
>>>>        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>>>
>>>>        at
>>>>
>>>>
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>
>>>>        at
>>>>
>>>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>
>>>>        at java.lang.Thread.run(Thread.java:744)
>>>>
>>>> 2013-10-29 12:56:25,787 ERROR fetcher.Fetcher - Fetcher:
>>>> java.io.IOException: Job failed!
>>>>
>>>>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>>>>
>>>>        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1340)
>>>>
>>>>        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1376)
>>>>
>>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>
>>>>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1349)
>>>>
>>>> I use nutch 1.7 and JDK 1.7.0.45.
>>>>
>>>> How to put java max heap size on crawl script? (-Xmx option)?
>>>>
>>>> Thanks in advance.-
>>>>
>>>> [1]
>>>> http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script
>>>>
>>>> --
>>>> wassalam,
>>>> [bayu]
>>>>
>>>
>>
> 
>

Re: How to set JVM heap size on crawl script?

Reply via email to