Hi,

good point.

Maybe we should implement a limit on the usage of boilerpipe:
- either by MIME type (only HTML types)
  I doubt that boilerpipe has been implemented for any formats except HTML
- or by document size (or size of the DOM tree)

Please open a Jira issue to implement this.

But you may also ask on the Tika user mailing list about the problem first.

Best,
Sebastian


On 3/18/19 11:49 AM, hany.n...@hsbc.com.INVALID wrote:
> Hi,
> 
> I found the root cause and it is not related to JVM Heap Size.
> 
> The problem of parsing these pdfs happen when I enable the tika extractor to 
> be boilerpipe.
> 
> Boilerpipe article extractor is working perfectly with other pdfs and pages; 
> when I disable it, Tika is able to parse and index these pdfs.
> 
> Any suggestion/help?
> 
> Kind regards, 
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT 
> Corporate Functions | HSBC Operations, Services and Technology (HOST)
> ul. Kapelanka 42A, 30-347 Kraków, Poland
> __________________________________________________________________ 
> 
> Tie line: 7148 7689 4698 
> External: +48 123 42 0698 
> Mobile: +48 723 680 278 
> E-mail: hany.n...@hsbc.com 
> __________________________________________________________________ 
> Protect our environment - please only print this if you have to!
> 
> 
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID] 
> Sent: 14 March 2019 13:06
> To: user@nutch.apache.org
> Subject: Re: OutOfMemoryError: GC overhead limit exceeded
> 
> Hi,
> 
> if running in local mode, it's better passed via ENV to bin/nutch, cf.
> 
> # Environment Variables
> #
> #   NUTCH_JAVA_HOME The java implementation to use.  Overrides JAVA_HOME.
> #
> #   NUTCH_HEAPSIZE  The maximum amount of heap to use, in MB.
> #                   Default is 1000.
> #
> #   NUTCH_OPTS      Extra Java runtime options.
> #                   Multiple options must be separated by white space.
> 
> In distributed mode, please read the Hadoop docs about mapper/reducer memory 
> and Java heap space.
> 
> Best,
> Sebastian
> 
> On 3/14/19 12:16 PM, hany.n...@hsbc.com.INVALID wrote:
>> I'm changing the mapred.child.java.opts=-Xmx1500m in crawl bash file.
>>
>> Is it correct?, should I change anywhere else?
>>
>>
>> Kind regards,
>> Hany Shehata
>> Enterprise Engineer
>> Green Six Sigma Certified
>> Solutions Architect, Marketing and Communications IT Corporate 
>> Functions | HSBC Operations, Services and Technology (HOST) ul. 
>> Kapelanka 42A, 30-347 Kraków, Poland 
>> __________________________________________________________________
>>
>> Tie line: 7148 7689 4698
>> External: +48 123 42 0698
>> Mobile: +48 723 680 278
>> E-mail: hany.n...@hsbc.com
>> __________________________________________________________________
>> Protect our environment - please only print this if you have to!
>>
>>
>> -----Original Message-----
>> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
>> Sent: 14 March 2019 10:59
>> To: user@nutch.apache.org
>> Subject: RE: OutOfMemoryError: GC overhead limit exceeded
>>
>> Hello - 1500 MB is a lot indeed, but 3500 PDF pages is even more. You have 
>> no choice, either skip large files, or increase memory.
>>
>> Regards,
>> Markus
>>
>>  
>>  
>> -----Original message-----
>>> From:hany.n...@hsbc.com.INVALID <hany.n...@hsbc.com.INVALID>
>>> Sent: Thursday 14th March 2019 10:44
>>> To: user@nutch.apache.org
>>> Subject: OutOfMemoryError: GC overhead limit exceeded
>>>
>>> Hello,
>>>
>>> I'm facing OutOfMemoryError: GC overhead limit exceeded exception while 
>>> trying to parse pdfs that includes 3500 pages.
>>>
>>> I increased the JVM RAM to 1500MB; however, I'm still facing the same 
>>> problem
>>>
>>> Please advise....
>>>
>>> 2019-03-08 05:31:55,269 WARN  parse.ParseUtil - Error parsing 
>>> http://domain/-/media/files/attachments/common/voting_disclosure_2014
>>> _ q2.pdf with org.apache.nutch.parse.tika.TikaParser
>>> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC 
>>> overhead limit exceeded
>>>                 at 
>>> java.util.concurrent.FutureTask.report(FutureTask.java:122)
>>>                 at java.util.concurrent.FutureTask.get(FutureTask.java:206)
>>>                 at 
>>> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
>>>                 at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
>>>                 at 
>>> org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:127)
>>>                 at 
>>> org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:78)
>>>                 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>>>                 at 
>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>>>                 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>>>                 at 
>>> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
>>>                 at 
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>>                 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>                 at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>                 at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>                 at java.lang.Thread.run(Thread.java:748)
>>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>                 at 
>>> org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:564)
>>>                 at 
>>> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
>>>                 at 
>>> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
>>>                 at 
>>> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>>>                 at 
>>> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>>>                 at 
>>> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
>>>                 at 
>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171)
>>>                 at 
>>> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:138)
>>>                 at 
>>> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:79)
>>>                 at 
>>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
>>>                 at
>>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
>>>
>>> Kind regards,
>>> Hany Shehata
>>> Enterprise Engineer
>>> Green Six Sigma Certified
>>> Solutions Architect, Marketing and Communications IT Corporate 
>>> Functions | HSBC Operations, Services and Technology (HOST) ul.
>>> Kapelanka 42A, 30-347 Kraków, Poland 
>>> __________________________________________________________________
>>>
>>> Tie line: 7148 7689 4698
>>> External: +48 123 42 0698
>>> Mobile: +48 723 680 278
>>> E-mail: hany.n...@hsbc.com<mailto:hany.n...@hsbc.com>
>>> __________________________________________________________________
>>> Protect our environment - please only print this if you have to!
>>>
>>>
>>>
>>> -----------------------------------------
>>> SAVE PAPER - THINK BEFORE YOU PRINT!
>>>
>>> This E-mail is confidential.  
>>>
>>> It may also be legally privileged. If you are not the addressee you 
>>> may not copy, forward, disclose or use any part of it. If you have 
>>> received this message in error, please delete it and all copies from 
>>> your system and notify the sender immediately by return E-mail.
>>>
>>> Internet communications cannot be guaranteed to be timely secure, error or 
>>> virus-free.
>>> The sender does not accept liability for any errors or omissions.
>>>
>>
>>
>> ***************************************************
>> This message originated from the Internet. Its originator may or may not be 
>> who they claim to be and the information contained in the message and any 
>> attachments may or may not be accurate.
>> ****************************************************
>>
>>  
>>
>>
>> -----------------------------------------
>> SAVE PAPER - THINK BEFORE YOU PRINT!
>>
>> This E-mail is confidential.  
>>
>> It may also be legally privileged. If you are not the addressee you 
>> may not copy, forward, disclose or use any part of it. If you have 
>> received this message in error, please delete it and all copies from 
>> your system and notify the sender immediately by return E-mail.
>>
>> Internet communications cannot be guaranteed to be timely secure, error or 
>> virus-free.
>> The sender does not accept liability for any errors or omissions.
>>
> 
> 
> 
> ***************************************************
> This message originated from the Internet. Its originator may or may not be 
> who they claim to be and the information contained in the message and any 
> attachments may or may not be accurate.
> ****************************************************
> 
>  
> 
> 
> -----------------------------------------
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.  
> 
> It may also be legally privileged. If you are not the addressee you may not 
> copy,
> forward, disclose or use any part of it. If you have received this message in 
> error,
> please delete it and all copies from your system and notify the sender 
> immediately by
> return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or 
> virus-free.
> The sender does not accept liability for any errors or omissions.
> 

Reply via email to