Hello, I'm facing OutOfMemoryError: GC overhead limit exceeded exception while trying to parse pdfs that includes 3500 pages.
I increased the JVM RAM to 1500MB; however, I'm still facing the same problem Please advise.... 2019-03-08 05:31:55,269 WARN parse.ParseUtil - Error parsing http://domain/-/media/files/attachments/common/voting_disclosure_2014_q2.pdf with org.apache.nutch.parse.tika.TikaParser java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:206) at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92) at org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:127) at org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:78) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:564) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392) at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171) at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:138) at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:79) at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) Kind regards, Hany Shehata Enterprise Engineer Green Six Sigma Certified Solutions Architect, Marketing and Communications IT Corporate Functions | HSBC Operations, Services and Technology (HOST) ul. Kapelanka 42A, 30-347 Kraków, Poland __________________________________________________________________ Tie line: 7148 7689 4698 External: +48 123 42 0698 Mobile: +48 723 680 278 E-mail: [email protected]<mailto:[email protected]> __________________________________________________________________ Protect our environment - please only print this if you have to! ----------------------------------------- SAVE PAPER - THINK BEFORE YOU PRINT! This E-mail is confidential. It may also be legally privileged. If you are not the addressee you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return E-mail. Internet communications cannot be guaranteed to be timely secure, error or virus-free. The sender does not accept liability for any errors or omissions.

