Hi, good point.
Maybe we should implement a limit on the usage of boilerpipe: - either by MIME type (only HTML types) I doubt that boilerpipe has been implemented for any formats except HTML - or by document size (or size of the DOM tree) Please open a Jira issue to implement this. But you may also ask on the Tika user mailing list about the problem first. Best, Sebastian On 3/18/19 11:49 AM, hany.n...@hsbc.com.INVALID wrote: > Hi, > > I found the root cause and it is not related to JVM Heap Size. > > The problem of parsing these pdfs happen when I enable the tika extractor to > be boilerpipe. > > Boilerpipe article extractor is working perfectly with other pdfs and pages; > when I disable it, Tika is able to parse and index these pdfs. > > Any suggestion/help? > > Kind regards, > Hany Shehata > Enterprise Engineer > Green Six Sigma Certified > Solutions Architect, Marketing and Communications IT > Corporate Functions | HSBC Operations, Services and Technology (HOST) > ul. Kapelanka 42A, 30-347 Kraków, Poland > __________________________________________________________________ > > Tie line: 7148 7689 4698 > External: +48 123 42 0698 > Mobile: +48 723 680 278 > E-mail: hany.n...@hsbc.com > __________________________________________________________________ > Protect our environment - please only print this if you have to! > > > -----Original Message----- > From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID] > Sent: 14 March 2019 13:06 > To: user@nutch.apache.org > Subject: Re: OutOfMemoryError: GC overhead limit exceeded > > Hi, > > if running in local mode, it's better passed via ENV to bin/nutch, cf. > > # Environment Variables > # > # NUTCH_JAVA_HOME The java implementation to use. Overrides JAVA_HOME. > # > # NUTCH_HEAPSIZE The maximum amount of heap to use, in MB. > # Default is 1000. > # > # NUTCH_OPTS Extra Java runtime options. > # Multiple options must be separated by white space. > > In distributed mode, please read the Hadoop docs about mapper/reducer memory > and Java heap space. > > Best, > Sebastian > > On 3/14/19 12:16 PM, hany.n...@hsbc.com.INVALID wrote: >> I'm changing the mapred.child.java.opts=-Xmx1500m in crawl bash file. >> >> Is it correct?, should I change anywhere else? >> >> >> Kind regards, >> Hany Shehata >> Enterprise Engineer >> Green Six Sigma Certified >> Solutions Architect, Marketing and Communications IT Corporate >> Functions | HSBC Operations, Services and Technology (HOST) ul. >> Kapelanka 42A, 30-347 Kraków, Poland >> __________________________________________________________________ >> >> Tie line: 7148 7689 4698 >> External: +48 123 42 0698 >> Mobile: +48 723 680 278 >> E-mail: hany.n...@hsbc.com >> __________________________________________________________________ >> Protect our environment - please only print this if you have to! >> >> >> -----Original Message----- >> From: Markus Jelsma [mailto:markus.jel...@openindex.io] >> Sent: 14 March 2019 10:59 >> To: user@nutch.apache.org >> Subject: RE: OutOfMemoryError: GC overhead limit exceeded >> >> Hello - 1500 MB is a lot indeed, but 3500 PDF pages is even more. You have >> no choice, either skip large files, or increase memory. >> >> Regards, >> Markus >> >> >> >> -----Original message----- >>> From:hany.n...@hsbc.com.INVALID <hany.n...@hsbc.com.INVALID> >>> Sent: Thursday 14th March 2019 10:44 >>> To: user@nutch.apache.org >>> Subject: OutOfMemoryError: GC overhead limit exceeded >>> >>> Hello, >>> >>> I'm facing OutOfMemoryError: GC overhead limit exceeded exception while >>> trying to parse pdfs that includes 3500 pages. >>> >>> I increased the JVM RAM to 1500MB; however, I'm still facing the same >>> problem >>> >>> Please advise.... >>> >>> 2019-03-08 05:31:55,269 WARN parse.ParseUtil - Error parsing >>> http://domain/-/media/files/attachments/common/voting_disclosure_2014 >>> _ q2.pdf with org.apache.nutch.parse.tika.TikaParser >>> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC >>> overhead limit exceeded >>> at >>> java.util.concurrent.FutureTask.report(FutureTask.java:122) >>> at java.util.concurrent.FutureTask.get(FutureTask.java:206) >>> at >>> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188) >>> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92) >>> at >>> org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:127) >>> at >>> org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:78) >>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) >>> at >>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) >>> at >>> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243) >>> at >>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>> at java.lang.Thread.run(Thread.java:748) >>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded >>> at >>> org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:564) >>> at >>> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392) >>> at >>> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) >>> at >>> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) >>> at >>> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) >>> at >>> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) >>> at >>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171) >>> at >>> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:138) >>> at >>> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:79) >>> at >>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) >>> at >>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) >>> >>> Kind regards, >>> Hany Shehata >>> Enterprise Engineer >>> Green Six Sigma Certified >>> Solutions Architect, Marketing and Communications IT Corporate >>> Functions | HSBC Operations, Services and Technology (HOST) ul. >>> Kapelanka 42A, 30-347 Kraków, Poland >>> __________________________________________________________________ >>> >>> Tie line: 7148 7689 4698 >>> External: +48 123 42 0698 >>> Mobile: +48 723 680 278 >>> E-mail: hany.n...@hsbc.com<mailto:hany.n...@hsbc.com> >>> __________________________________________________________________ >>> Protect our environment - please only print this if you have to! >>> >>> >>> >>> ----------------------------------------- >>> SAVE PAPER - THINK BEFORE YOU PRINT! >>> >>> This E-mail is confidential. >>> >>> It may also be legally privileged. If you are not the addressee you >>> may not copy, forward, disclose or use any part of it. If you have >>> received this message in error, please delete it and all copies from >>> your system and notify the sender immediately by return E-mail. >>> >>> Internet communications cannot be guaranteed to be timely secure, error or >>> virus-free. >>> The sender does not accept liability for any errors or omissions. >>> >> >> >> *************************************************** >> This message originated from the Internet. Its originator may or may not be >> who they claim to be and the information contained in the message and any >> attachments may or may not be accurate. >> **************************************************** >> >> >> >> >> ----------------------------------------- >> SAVE PAPER - THINK BEFORE YOU PRINT! >> >> This E-mail is confidential. >> >> It may also be legally privileged. If you are not the addressee you >> may not copy, forward, disclose or use any part of it. If you have >> received this message in error, please delete it and all copies from >> your system and notify the sender immediately by return E-mail. >> >> Internet communications cannot be guaranteed to be timely secure, error or >> virus-free. >> The sender does not accept liability for any errors or omissions. >> > > > > *************************************************** > This message originated from the Internet. Its originator may or may not be > who they claim to be and the information contained in the message and any > attachments may or may not be accurate. > **************************************************** > > > > > ----------------------------------------- > SAVE PAPER - THINK BEFORE YOU PRINT! > > This E-mail is confidential. > > It may also be legally privileged. If you are not the addressee you may not > copy, > forward, disclose or use any part of it. If you have received this message in > error, > please delete it and all copies from your system and notify the sender > immediately by > return E-mail. > > Internet communications cannot be guaranteed to be timely secure, error or > virus-free. > The sender does not accept liability for any errors or omissions. >