Hi,

if running in local mode, it's better passed via ENV to bin/nutch, cf.

# Environment Variables
#
#   NUTCH_JAVA_HOME The java implementation to use.  Overrides JAVA_HOME.
#
#   NUTCH_HEAPSIZE  The maximum amount of heap to use, in MB.
#                   Default is 1000.
#
#   NUTCH_OPTS      Extra Java runtime options.
#                   Multiple options must be separated by white space.

In distributed mode, please read the Hadoop docs about mapper/reducer memory and
Java heap space.

Best,
Sebastian

On 3/14/19 12:16 PM, hany.n...@hsbc.com.INVALID wrote:
> I'm changing the mapred.child.java.opts=-Xmx1500m in crawl bash file.
> 
> Is it correct?, should I change anywhere else?
> 
> 
> Kind regards, 
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT 
> Corporate Functions | HSBC Operations, Services and Technology (HOST)
> ul. Kapelanka 42A, 30-347 Kraków, Poland
> __________________________________________________________________ 
> 
> Tie line: 7148 7689 4698 
> External: +48 123 42 0698 
> Mobile: +48 723 680 278 
> E-mail: hany.n...@hsbc.com 
> __________________________________________________________________ 
> Protect our environment - please only print this if you have to!
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> Sent: 14 March 2019 10:59
> To: user@nutch.apache.org
> Subject: RE: OutOfMemoryError: GC overhead limit exceeded
> 
> Hello - 1500 MB is a lot indeed, but 3500 PDF pages is even more. You have no 
> choice, either skip large files, or increase memory.
> 
> Regards,
> Markus
> 
>  
>  
> -----Original message-----
>> From:hany.n...@hsbc.com.INVALID <hany.n...@hsbc.com.INVALID>
>> Sent: Thursday 14th March 2019 10:44
>> To: user@nutch.apache.org
>> Subject: OutOfMemoryError: GC overhead limit exceeded
>>
>> Hello,
>>
>> I'm facing OutOfMemoryError: GC overhead limit exceeded exception while 
>> trying to parse pdfs that includes 3500 pages.
>>
>> I increased the JVM RAM to 1500MB; however, I'm still facing the same 
>> problem
>>
>> Please advise....
>>
>> 2019-03-08 05:31:55,269 WARN  parse.ParseUtil - Error parsing 
>> http://domain/-/media/files/attachments/common/voting_disclosure_2014_
>> q2.pdf with org.apache.nutch.parse.tika.TikaParser
>> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC 
>> overhead limit exceeded
>>                 at 
>> java.util.concurrent.FutureTask.report(FutureTask.java:122)
>>                 at java.util.concurrent.FutureTask.get(FutureTask.java:206)
>>                 at 
>> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
>>                 at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
>>                 at 
>> org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:127)
>>                 at 
>> org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:78)
>>                 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>>                 at 
>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>>                 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>>                 at 
>> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
>>                 at 
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>                 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>                 at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>                 at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>                 at java.lang.Thread.run(Thread.java:748)
>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>>                 at 
>> org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:564)
>>                 at 
>> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
>>                 at 
>> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
>>                 at 
>> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>>                 at 
>> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>>                 at 
>> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
>>                 at 
>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171)
>>                 at 
>> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:138)
>>                 at 
>> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:79)
>>                 at 
>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
>>                 at 
>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
>>
>> Kind regards,
>> Hany Shehata
>> Enterprise Engineer
>> Green Six Sigma Certified
>> Solutions Architect, Marketing and Communications IT Corporate 
>> Functions | HSBC Operations, Services and Technology (HOST) ul. 
>> Kapelanka 42A, 30-347 Kraków, Poland 
>> __________________________________________________________________
>>
>> Tie line: 7148 7689 4698
>> External: +48 123 42 0698
>> Mobile: +48 723 680 278
>> E-mail: hany.n...@hsbc.com<mailto:hany.n...@hsbc.com>
>> __________________________________________________________________
>> Protect our environment - please only print this if you have to!
>>
>>
>>
>> -----------------------------------------
>> SAVE PAPER - THINK BEFORE YOU PRINT!
>>
>> This E-mail is confidential.  
>>
>> It may also be legally privileged. If you are not the addressee you 
>> may not copy, forward, disclose or use any part of it. If you have 
>> received this message in error, please delete it and all copies from 
>> your system and notify the sender immediately by return E-mail.
>>
>> Internet communications cannot be guaranteed to be timely secure, error or 
>> virus-free.
>> The sender does not accept liability for any errors or omissions.
>>
> 
> 
> ***************************************************
> This message originated from the Internet. Its originator may or may not be 
> who they claim to be and the information contained in the message and any 
> attachments may or may not be accurate.
> ****************************************************
> 
>  
> 
> 
> -----------------------------------------
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.  
> 
> It may also be legally privileged. If you are not the addressee you may not 
> copy,
> forward, disclose or use any part of it. If you have received this message in 
> error,
> please delete it and all copies from your system and notify the sender 
> immediately by
> return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or 
> virus-free.
> The sender does not accept liability for any errors or omissions.
> 

Reply via email to