Hi, I found the root cause and it is not related to JVM Heap Size.
The problem of parsing these pdfs happen when I enable the tika extractor to be boilerpipe. Boilerpipe article extractor is working perfectly with other pdfs and pages; when I disable it, Tika is able to parse and index these pdfs. Any suggestion/help? Kind regards, Hany Shehata Enterprise Engineer Green Six Sigma Certified Solutions Architect, Marketing and Communications IT Corporate Functions | HSBC Operations, Services and Technology (HOST) ul. Kapelanka 42A, 30-347 Kraków, Poland __________________________________________________________________ Tie line: 7148 7689 4698 External: +48 123 42 0698 Mobile: +48 723 680 278 E-mail: hany.n...@hsbc.com __________________________________________________________________ Protect our environment - please only print this if you have to! -----Original Message----- From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID] Sent: 14 March 2019 13:06 To: user@nutch.apache.org Subject: Re: OutOfMemoryError: GC overhead limit exceeded Hi, if running in local mode, it's better passed via ENV to bin/nutch, cf. # Environment Variables # # NUTCH_JAVA_HOME The java implementation to use. Overrides JAVA_HOME. # # NUTCH_HEAPSIZE The maximum amount of heap to use, in MB. # Default is 1000. # # NUTCH_OPTS Extra Java runtime options. # Multiple options must be separated by white space. In distributed mode, please read the Hadoop docs about mapper/reducer memory and Java heap space. Best, Sebastian On 3/14/19 12:16 PM, hany.n...@hsbc.com.INVALID wrote: > I'm changing the mapred.child.java.opts=-Xmx1500m in crawl bash file. > > Is it correct?, should I change anywhere else? > > > Kind regards, > Hany Shehata > Enterprise Engineer > Green Six Sigma Certified > Solutions Architect, Marketing and Communications IT Corporate > Functions | HSBC Operations, Services and Technology (HOST) ul. > Kapelanka 42A, 30-347 Kraków, Poland > __________________________________________________________________ > > Tie line: 7148 7689 4698 > External: +48 123 42 0698 > Mobile: +48 723 680 278 > E-mail: hany.n...@hsbc.com > __________________________________________________________________ > Protect our environment - please only print this if you have to! > > > -----Original Message----- > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > Sent: 14 March 2019 10:59 > To: user@nutch.apache.org > Subject: RE: OutOfMemoryError: GC overhead limit exceeded > > Hello - 1500 MB is a lot indeed, but 3500 PDF pages is even more. You have no > choice, either skip large files, or increase memory. > > Regards, > Markus > > > > -----Original message----- >> From:hany.n...@hsbc.com.INVALID <hany.n...@hsbc.com.INVALID> >> Sent: Thursday 14th March 2019 10:44 >> To: user@nutch.apache.org >> Subject: OutOfMemoryError: GC overhead limit exceeded >> >> Hello, >> >> I'm facing OutOfMemoryError: GC overhead limit exceeded exception while >> trying to parse pdfs that includes 3500 pages. >> >> I increased the JVM RAM to 1500MB; however, I'm still facing the same >> problem >> >> Please advise.... >> >> 2019-03-08 05:31:55,269 WARN parse.ParseUtil - Error parsing >> http://domain/-/media/files/attachments/common/voting_disclosure_2014 >> _ q2.pdf with org.apache.nutch.parse.tika.TikaParser >> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC >> overhead limit exceeded >> at >> java.util.concurrent.FutureTask.report(FutureTask.java:122) >> at java.util.concurrent.FutureTask.get(FutureTask.java:206) >> at >> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188) >> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92) >> at >> org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:127) >> at >> org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:78) >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) >> at >> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) >> at >> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243) >> at >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >> at java.lang.Thread.run(Thread.java:748) >> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded >> at >> org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:564) >> at >> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392) >> at >> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) >> at >> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) >> at >> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) >> at >> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) >> at >> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171) >> at >> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:138) >> at >> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:79) >> at >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) >> at >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) >> >> Kind regards, >> Hany Shehata >> Enterprise Engineer >> Green Six Sigma Certified >> Solutions Architect, Marketing and Communications IT Corporate >> Functions | HSBC Operations, Services and Technology (HOST) ul. >> Kapelanka 42A, 30-347 Kraków, Poland >> __________________________________________________________________ >> >> Tie line: 7148 7689 4698 >> External: +48 123 42 0698 >> Mobile: +48 723 680 278 >> E-mail: hany.n...@hsbc.com<mailto:hany.n...@hsbc.com> >> __________________________________________________________________ >> Protect our environment - please only print this if you have to! >> >> >> >> ----------------------------------------- >> SAVE PAPER - THINK BEFORE YOU PRINT! >> >> This E-mail is confidential. >> >> It may also be legally privileged. If you are not the addressee you >> may not copy, forward, disclose or use any part of it. If you have >> received this message in error, please delete it and all copies from >> your system and notify the sender immediately by return E-mail. >> >> Internet communications cannot be guaranteed to be timely secure, error or >> virus-free. >> The sender does not accept liability for any errors or omissions. >> > > > *************************************************** > This message originated from the Internet. Its originator may or may not be > who they claim to be and the information contained in the message and any > attachments may or may not be accurate. > **************************************************** > > > > > ----------------------------------------- > SAVE PAPER - THINK BEFORE YOU PRINT! > > This E-mail is confidential. > > It may also be legally privileged. If you are not the addressee you > may not copy, forward, disclose or use any part of it. If you have > received this message in error, please delete it and all copies from > your system and notify the sender immediately by return E-mail. > > Internet communications cannot be guaranteed to be timely secure, error or > virus-free. > The sender does not accept liability for any errors or omissions. > *************************************************** This message originated from the Internet. Its originator may or may not be who they claim to be and the information contained in the message and any attachments may or may not be accurate. **************************************************** ----------------------------------------- SAVE PAPER - THINK BEFORE YOU PRINT! This E-mail is confidential. It may also be legally privileged. If you are not the addressee you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return E-mail. Internet communications cannot be guaranteed to be timely secure, error or virus-free. The sender does not accept liability for any errors or omissions.