Hello,

I'm facing OutOfMemoryError: GC overhead limit exceeded exception while trying 
to parse pdfs that includes 3500 pages.

I increased the JVM RAM to 1500MB; however, I'm still facing the same problem

Please advise....

2019-03-08 05:31:55,269 WARN  parse.ParseUtil - Error parsing 
http://domain/-/media/files/attachments/common/voting_disclosure_2014_q2.pdf 
with org.apache.nutch.parse.tika.TikaParser
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC 
overhead limit exceeded
                at java.util.concurrent.FutureTask.report(FutureTask.java:122)
                at java.util.concurrent.FutureTask.get(FutureTask.java:206)
                at 
org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
                at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
                at 
org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:127)
                at 
org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:78)
                at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
                at 
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
                at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
                at 
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
                at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
                at 
org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:564)
                at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
                at 
org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
                at 
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
                at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
                at 
org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
                at 
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171)
                at 
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:138)
                at 
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:79)
                at 
org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
                at 
org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)

Kind regards,
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__________________________________________________________________

Tie line: 7148 7689 4698
External: +48 123 42 0698
Mobile: +48 723 680 278
E-mail: hany.n...@hsbc.com<mailto:hany.n...@hsbc.com>
__________________________________________________________________
Protect our environment - please only print this if you have to!



-----------------------------------------
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.

Reply via email to