RE: OutOfMemoryError: GC overhead limit exceeded

2019-03-18 Thread hany . nasr
Hi,

JIRA Ticket is created: https://issues.apache.org/jira/browse/NUTCH-2703

I'm able to crawl the website and these huge pdfs with 500MB JVM heap without 
Boilerpipe.

Enabling Boilerpipe forced me to increase the JVM heap to 8500MB.

Hope this bug can be fixed in Nutch 1.16.

Kind regards, 
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.n...@hsbc.com 
__ 
Protect our environment - please only print this if you have to!


-Original Message-
From: hany.n...@hsbc.com.INVALID [mailto:hany.n...@hsbc.com.INVALID] 
Sent: 18 March 2019 12:21
To: user@nutch.apache.org
Subject: RE: OutOfMemoryError: GC overhead limit exceeded

Hello Markus,

I am able to parse these pdfs without increasing the heap. If tika extractor is 
none.

I did increase the heap with Boilerpipe enabled and didn't work by giving me 
"failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully 
parse content", then OOM.

Kind regards,
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT Corporate Functions | HSBC 
Operations, Services and Technology (HOST) ul. Kapelanka 42A, 30-347 Kraków, 
Poland __ 

Tie line: 7148 7689 4698
External: +48 123 42 0698
Mobile: +48 723 680 278
E-mail: hany.n...@hsbc.com
__
Protect our environment - please only print this if you have to!


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: 18 March 2019 12:12
To: user@nutch.apache.org
Subject: RE: OutOfMemoryError: GC overhead limit exceeded

Hello Hany,

If you deal with large PDF files, and you get an OOM with this stack trace, it 
is highly unlikely due to Boilerpipe being active. Boilerpipe does not run 
before PDFBox is finished so you should really increase the heap.

Of course, to answer the question, Boilerpipe should not run for non-(X)HTML 
pages anyway, so you can open a ticket. But the resources saved by such a 
change would be minimal at best.

Regards,
Markus
 
-Original message-
> From:hany.n...@hsbc.com.INVALID 
> Sent: Monday 18th March 2019 11:49
> To: user@nutch.apache.org
> Subject: RE: OutOfMemoryError: GC overhead limit exceeded
> 
> Hi,
> 
> I found the root cause and it is not related to JVM Heap Size.
> 
> The problem of parsing these pdfs happen when I enable the tika extractor to 
> be boilerpipe.
> 
> Boilerpipe article extractor is working perfectly with other pdfs and pages; 
> when I disable it, Tika is able to parse and index these pdfs.
> 
> Any suggestion/help?
> 
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT Corporate 
> Functions | HSBC Operations, Services and Technology (HOST) ul.
> Kapelanka 42A, 30-347 Kraków, Poland
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __
> Protect our environment - please only print this if you have to!
> 
> 
> -Original Message-
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID]
> Sent: 14 March 2019 13:06
> To: user@nutch.apache.org
> Subject: Re: OutOfMemoryError: GC overhead limit exceeded
> 
> Hi,
> 
> if running in local mode, it's better passed via ENV to bin/nutch, cf.
> 
> # Environment Variables
> #
> #   NUTCH_JAVA_HOME The java implementation to use.  Overrides JAVA_HOME.
> #
> #   NUTCH_HEAPSIZE  The maximum amount of heap to use, in MB.
> #   Default is 1000.
> #
> #   NUTCH_OPTS  Extra Java runtime options.
> #   Multiple options must be separated by white space.
> 
> In distributed mode, please read the Hadoop docs about mapper/reducer memory 
> and Java heap space.
> 
> Best,
> Sebastian
> 
> On 3/14/19 12:16 PM, hany.n...@hsbc.com.INVALID wrote:
> > I'm changing the mapred.child.java.opts=-Xmx1500m in crawl bash file.
> > 
> > Is it correct?, should I change anywhere else?
> > 
> > 
> > Kind regards,
> > Hany Shehata
> > Enterprise Engineer
> > Green Six Sigma Certified
> > Solutions Architect, Marketing and 

RE: OutOfMemoryError: GC overhead limit exceeded

2019-03-18 Thread hany . nasr
Hello Markus,

I am able to parse these pdfs without increasing the heap. If tika extractor is 
none.

I did increase the heap with Boilerpipe enabled and didn't work by giving me 
"failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully 
parse content", then OOM.

Kind regards, 
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.n...@hsbc.com 
__ 
Protect our environment - please only print this if you have to!


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: 18 March 2019 12:12
To: user@nutch.apache.org
Subject: RE: OutOfMemoryError: GC overhead limit exceeded

Hello Hany,

If you deal with large PDF files, and you get an OOM with this stack trace, it 
is highly unlikely due to Boilerpipe being active. Boilerpipe does not run 
before PDFBox is finished so you should really increase the heap.

Of course, to answer the question, Boilerpipe should not run for non-(X)HTML 
pages anyway, so you can open a ticket. But the resources saved by such a 
change would be minimal at best.

Regards,
Markus
 
-Original message-
> From:hany.n...@hsbc.com.INVALID 
> Sent: Monday 18th March 2019 11:49
> To: user@nutch.apache.org
> Subject: RE: OutOfMemoryError: GC overhead limit exceeded
> 
> Hi,
> 
> I found the root cause and it is not related to JVM Heap Size.
> 
> The problem of parsing these pdfs happen when I enable the tika extractor to 
> be boilerpipe.
> 
> Boilerpipe article extractor is working perfectly with other pdfs and pages; 
> when I disable it, Tika is able to parse and index these pdfs.
> 
> Any suggestion/help?
> 
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT Corporate 
> Functions | HSBC Operations, Services and Technology (HOST) ul. 
> Kapelanka 42A, 30-347 Kraków, Poland 
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __
> Protect our environment - please only print this if you have to!
> 
> 
> -Original Message-
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID]
> Sent: 14 March 2019 13:06
> To: user@nutch.apache.org
> Subject: Re: OutOfMemoryError: GC overhead limit exceeded
> 
> Hi,
> 
> if running in local mode, it's better passed via ENV to bin/nutch, cf.
> 
> # Environment Variables
> #
> #   NUTCH_JAVA_HOME The java implementation to use.  Overrides JAVA_HOME.
> #
> #   NUTCH_HEAPSIZE  The maximum amount of heap to use, in MB.
> #   Default is 1000.
> #
> #   NUTCH_OPTS  Extra Java runtime options.
> #   Multiple options must be separated by white space.
> 
> In distributed mode, please read the Hadoop docs about mapper/reducer memory 
> and Java heap space.
> 
> Best,
> Sebastian
> 
> On 3/14/19 12:16 PM, hany.n...@hsbc.com.INVALID wrote:
> > I'm changing the mapred.child.java.opts=-Xmx1500m in crawl bash file.
> > 
> > Is it correct?, should I change anywhere else?
> > 
> > 
> > Kind regards,
> > Hany Shehata
> > Enterprise Engineer
> > Green Six Sigma Certified
> > Solutions Architect, Marketing and Communications IT Corporate 
> > Functions | HSBC Operations, Services and Technology (HOST) ul.
> > Kapelanka 42A, 30-347 Kraków, Poland 
> > __
> > 
> > Tie line: 7148 7689 4698
> > External: +48 123 42 0698
> > Mobile: +48 723 680 278
> > E-mail: hany.n...@hsbc.com
> > __________
> > Protect our environment - please only print this if you have to!
> > 
> > 
> > -Original Message-
> > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > Sent: 14 March 2019 10:59
> > To: user@nutch.apache.org
> > Subject: RE: OutOfMemoryError: GC overhead limit exceeded
> > 
> > Hello - 1500 MB is a lot indeed, but 3500 PDF pages is even more. You have 
> > no choice, either skip large files, or increase memory.
> > 
> > Regards,
> > Markus
> > 
> >  
>

RE: OutOfMemoryError: GC overhead limit exceeded

2019-03-18 Thread Markus Jelsma
Hello Hany,

If you deal with large PDF files, and you get an OOM with this stack trace, it 
is highly unlikely due to Boilerpipe being active. Boilerpipe does not run 
before PDFBox is finished so you should really increase the heap.

Of course, to answer the question, Boilerpipe should not run for non-(X)HTML 
pages anyway, so you can open a ticket. But the resources saved by such a 
change would be minimal at best.

Regards,
Markus
 
-Original message-
> From:hany.n...@hsbc.com.INVALID 
> Sent: Monday 18th March 2019 11:49
> To: user@nutch.apache.org
> Subject: RE: OutOfMemoryError: GC overhead limit exceeded
> 
> Hi,
> 
> I found the root cause and it is not related to JVM Heap Size.
> 
> The problem of parsing these pdfs happen when I enable the tika extractor to 
> be boilerpipe.
> 
> Boilerpipe article extractor is working perfectly with other pdfs and pages; 
> when I disable it, Tika is able to parse and index these pdfs.
> 
> Any suggestion/help?
> 
> Kind regards, 
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT 
> Corporate Functions | HSBC Operations, Services and Technology (HOST)
> ul. Kapelanka 42A, 30-347 Kraków, Poland
> __ 
> 
> Tie line: 7148 7689 4698 
> External: +48 123 42 0698 
> Mobile: +48 723 680 278 
> E-mail: hany.n...@hsbc.com 
> __ 
> Protect our environment - please only print this if you have to!
> 
> 
> -Original Message-
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID] 
> Sent: 14 March 2019 13:06
> To: user@nutch.apache.org
> Subject: Re: OutOfMemoryError: GC overhead limit exceeded
> 
> Hi,
> 
> if running in local mode, it's better passed via ENV to bin/nutch, cf.
> 
> # Environment Variables
> #
> #   NUTCH_JAVA_HOME The java implementation to use.  Overrides JAVA_HOME.
> #
> #   NUTCH_HEAPSIZE  The maximum amount of heap to use, in MB.
> #   Default is 1000.
> #
> #   NUTCH_OPTS  Extra Java runtime options.
> #   Multiple options must be separated by white space.
> 
> In distributed mode, please read the Hadoop docs about mapper/reducer memory 
> and Java heap space.
> 
> Best,
> Sebastian
> 
> On 3/14/19 12:16 PM, hany.n...@hsbc.com.INVALID wrote:
> > I'm changing the mapred.child.java.opts=-Xmx1500m in crawl bash file.
> > 
> > Is it correct?, should I change anywhere else?
> > 
> > 
> > Kind regards,
> > Hany Shehata
> > Enterprise Engineer
> > Green Six Sigma Certified
> > Solutions Architect, Marketing and Communications IT Corporate 
> > Functions | HSBC Operations, Services and Technology (HOST) ul. 
> > Kapelanka 42A, 30-347 Kraków, Poland 
> > __
> > 
> > Tie line: 7148 7689 4698
> > External: +48 123 42 0698
> > Mobile: +48 723 680 278
> > E-mail: hany.n...@hsbc.com
> > ______________
> > Protect our environment - please only print this if you have to!
> > 
> > 
> > -Original Message-
> > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > Sent: 14 March 2019 10:59
> > To: user@nutch.apache.org
> > Subject: RE: OutOfMemoryError: GC overhead limit exceeded
> > 
> > Hello - 1500 MB is a lot indeed, but 3500 PDF pages is even more. You have 
> > no choice, either skip large files, or increase memory.
> > 
> > Regards,
> > Markus
> > 
> >  
> >  
> > -Original message-
> >> From:hany.n...@hsbc.com.INVALID 
> >> Sent: Thursday 14th March 2019 10:44
> >> To: user@nutch.apache.org
> >> Subject: OutOfMemoryError: GC overhead limit exceeded
> >>
> >> Hello,
> >>
> >> I'm facing OutOfMemoryError: GC overhead limit exceeded exception while 
> >> trying to parse pdfs that includes 3500 pages.
> >>
> >> I increased the JVM RAM to 1500MB; however, I'm still facing the same 
> >> problem
> >>
> >> Please advise
> >>
> >> 2019-03-08 05:31:55,269 WARN  parse.ParseUtil - Error parsing 
> >> http://domain/-/media/files/attachments/common/voting_disclosure_2014
> >> _ q2.pdf with org.apache.nutch.parse.tika.TikaParser
> >> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC 
> >> overhead limit exceeded
> >> at 
> >> 

RE: OutOfMemoryError: GC overhead limit exceeded

2019-03-18 Thread hany . nasr
Hi,

Is there any workaround for now to exclude pdfs from the usage of boilerpipe?


Kind regards, 
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.n...@hsbc.com 
__ 
Protect our environment - please only print this if you have to!


-Original Message-
From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID] 
Sent: 18 March 2019 12:01
To: user@nutch.apache.org
Subject: Re: OutOfMemoryError: GC overhead limit exceeded

Hi,

good point.

Maybe we should implement a limit on the usage of boilerpipe:
- either by MIME type (only HTML types)
  I doubt that boilerpipe has been implemented for any formats except HTML
- or by document size (or size of the DOM tree)

Please open a Jira issue to implement this.

But you may also ask on the Tika user mailing list about the problem first.

Best,
Sebastian


On 3/18/19 11:49 AM, hany.n...@hsbc.com.INVALID wrote:
> Hi,
> 
> I found the root cause and it is not related to JVM Heap Size.
> 
> The problem of parsing these pdfs happen when I enable the tika extractor to 
> be boilerpipe.
> 
> Boilerpipe article extractor is working perfectly with other pdfs and pages; 
> when I disable it, Tika is able to parse and index these pdfs.
> 
> Any suggestion/help?
> 
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT Corporate 
> Functions | HSBC Operations, Services and Technology (HOST) ul. 
> Kapelanka 42A, 30-347 Kraków, Poland 
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __
> Protect our environment - please only print this if you have to!
> 
> 
> -Original Message-
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID]
> Sent: 14 March 2019 13:06
> To: user@nutch.apache.org
> Subject: Re: OutOfMemoryError: GC overhead limit exceeded
> 
> Hi,
> 
> if running in local mode, it's better passed via ENV to bin/nutch, cf.
> 
> # Environment Variables
> #
> #   NUTCH_JAVA_HOME The java implementation to use.  Overrides JAVA_HOME.
> #
> #   NUTCH_HEAPSIZE  The maximum amount of heap to use, in MB.
> #   Default is 1000.
> #
> #   NUTCH_OPTS  Extra Java runtime options.
> #   Multiple options must be separated by white space.
> 
> In distributed mode, please read the Hadoop docs about mapper/reducer memory 
> and Java heap space.
> 
> Best,
> Sebastian
> 
> On 3/14/19 12:16 PM, hany.n...@hsbc.com.INVALID wrote:
>> I'm changing the mapred.child.java.opts=-Xmx1500m in crawl bash file.
>>
>> Is it correct?, should I change anywhere else?
>>
>>
>> Kind regards,
>> Hany Shehata
>> Enterprise Engineer
>> Green Six Sigma Certified
>> Solutions Architect, Marketing and Communications IT Corporate 
>> Functions | HSBC Operations, Services and Technology (HOST) ul.
>> Kapelanka 42A, 30-347 Kraków, Poland 
>> __
>>
>> Tie line: 7148 7689 4698
>> External: +48 123 42 0698
>> Mobile: +48 723 680 278
>> E-mail: hany.n...@hsbc.com
>> __________
>> Protect our environment - please only print this if you have to!
>>
>>
>> -Original Message-
>> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
>> Sent: 14 March 2019 10:59
>> To: user@nutch.apache.org
>> Subject: RE: OutOfMemoryError: GC overhead limit exceeded
>>
>> Hello - 1500 MB is a lot indeed, but 3500 PDF pages is even more. You have 
>> no choice, either skip large files, or increase memory.
>>
>> Regards,
>> Markus
>>
>>  
>>  
>> -Original message-
>>> From:hany.n...@hsbc.com.INVALID 
>>> Sent: Thursday 14th March 2019 10:44
>>> To: user@nutch.apache.org
>>> Subject: OutOfMemoryError: GC overhead limit exceeded
>>>
>>> Hello,
>>>
>>> I'm facing OutOfMemoryError: GC overhead limit exceeded exception while 
>>> trying to parse pdfs that includes 3500 pages.
>>>
>>

Re: OutOfMemoryError: GC overhead limit exceeded

2019-03-18 Thread Sebastian Nagel
Hi,

good point.

Maybe we should implement a limit on the usage of boilerpipe:
- either by MIME type (only HTML types)
  I doubt that boilerpipe has been implemented for any formats except HTML
- or by document size (or size of the DOM tree)

Please open a Jira issue to implement this.

But you may also ask on the Tika user mailing list about the problem first.

Best,
Sebastian


On 3/18/19 11:49 AM, hany.n...@hsbc.com.INVALID wrote:
> Hi,
> 
> I found the root cause and it is not related to JVM Heap Size.
> 
> The problem of parsing these pdfs happen when I enable the tika extractor to 
> be boilerpipe.
> 
> Boilerpipe article extractor is working perfectly with other pdfs and pages; 
> when I disable it, Tika is able to parse and index these pdfs.
> 
> Any suggestion/help?
> 
> Kind regards, 
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT 
> Corporate Functions | HSBC Operations, Services and Technology (HOST)
> ul. Kapelanka 42A, 30-347 Kraków, Poland
> __ 
> 
> Tie line: 7148 7689 4698 
> External: +48 123 42 0698 
> Mobile: +48 723 680 278 
> E-mail: hany.n...@hsbc.com 
> __ 
> Protect our environment - please only print this if you have to!
> 
> 
> -Original Message-
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID] 
> Sent: 14 March 2019 13:06
> To: user@nutch.apache.org
> Subject: Re: OutOfMemoryError: GC overhead limit exceeded
> 
> Hi,
> 
> if running in local mode, it's better passed via ENV to bin/nutch, cf.
> 
> # Environment Variables
> #
> #   NUTCH_JAVA_HOME The java implementation to use.  Overrides JAVA_HOME.
> #
> #   NUTCH_HEAPSIZE  The maximum amount of heap to use, in MB.
> #   Default is 1000.
> #
> #   NUTCH_OPTS  Extra Java runtime options.
> #   Multiple options must be separated by white space.
> 
> In distributed mode, please read the Hadoop docs about mapper/reducer memory 
> and Java heap space.
> 
> Best,
> Sebastian
> 
> On 3/14/19 12:16 PM, hany.n...@hsbc.com.INVALID wrote:
>> I'm changing the mapred.child.java.opts=-Xmx1500m in crawl bash file.
>>
>> Is it correct?, should I change anywhere else?
>>
>>
>> Kind regards,
>> Hany Shehata
>> Enterprise Engineer
>> Green Six Sigma Certified
>> Solutions Architect, Marketing and Communications IT Corporate 
>> Functions | HSBC Operations, Services and Technology (HOST) ul. 
>> Kapelanka 42A, 30-347 Kraków, Poland 
>> __
>>
>> Tie line: 7148 7689 4698
>> External: +48 123 42 0698
>> Mobile: +48 723 680 278
>> E-mail: hany.n...@hsbc.com
>> __________
>> Protect our environment - please only print this if you have to!
>>
>>
>> -Original Message-
>> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
>> Sent: 14 March 2019 10:59
>> To: user@nutch.apache.org
>> Subject: RE: OutOfMemoryError: GC overhead limit exceeded
>>
>> Hello - 1500 MB is a lot indeed, but 3500 PDF pages is even more. You have 
>> no choice, either skip large files, or increase memory.
>>
>> Regards,
>> Markus
>>
>>  
>>  
>> -Original message-
>>> From:hany.n...@hsbc.com.INVALID 
>>> Sent: Thursday 14th March 2019 10:44
>>> To: user@nutch.apache.org
>>> Subject: OutOfMemoryError: GC overhead limit exceeded
>>>
>>> Hello,
>>>
>>> I'm facing OutOfMemoryError: GC overhead limit exceeded exception while 
>>> trying to parse pdfs that includes 3500 pages.
>>>
>>> I increased the JVM RAM to 1500MB; however, I'm still facing the same 
>>> problem
>>>
>>> Please advise
>>>
>>> 2019-03-08 05:31:55,269 WARN  parse.ParseUtil - Error parsing 
>>> http://domain/-/media/files/attachments/common/voting_disclosure_2014
>>> _ q2.pdf with org.apache.nutch.parse.tika.TikaParser
>>> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC 
>>> overhead limit exceeded
>>> at 
>>> java.util.concurrent.FutureTask.report(FutureTask.java:122)
>>> at java.util.concurrent.FutureTask.get(FutureTask.java:206)
>>> at 
>>> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
>&

RE: OutOfMemoryError: GC overhead limit exceeded

2019-03-18 Thread hany . nasr
Hi,

I found the root cause and it is not related to JVM Heap Size.

The problem of parsing these pdfs happen when I enable the tika extractor to be 
boilerpipe.

Boilerpipe article extractor is working perfectly with other pdfs and pages; 
when I disable it, Tika is able to parse and index these pdfs.

Any suggestion/help?

Kind regards, 
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.n...@hsbc.com 
__ 
Protect our environment - please only print this if you have to!


-Original Message-
From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID] 
Sent: 14 March 2019 13:06
To: user@nutch.apache.org
Subject: Re: OutOfMemoryError: GC overhead limit exceeded

Hi,

if running in local mode, it's better passed via ENV to bin/nutch, cf.

# Environment Variables
#
#   NUTCH_JAVA_HOME The java implementation to use.  Overrides JAVA_HOME.
#
#   NUTCH_HEAPSIZE  The maximum amount of heap to use, in MB.
#   Default is 1000.
#
#   NUTCH_OPTS  Extra Java runtime options.
#   Multiple options must be separated by white space.

In distributed mode, please read the Hadoop docs about mapper/reducer memory 
and Java heap space.

Best,
Sebastian

On 3/14/19 12:16 PM, hany.n...@hsbc.com.INVALID wrote:
> I'm changing the mapred.child.java.opts=-Xmx1500m in crawl bash file.
> 
> Is it correct?, should I change anywhere else?
> 
> 
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT Corporate 
> Functions | HSBC Operations, Services and Technology (HOST) ul. 
> Kapelanka 42A, 30-347 Kraków, Poland 
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __
> Protect our environment - please only print this if you have to!
> 
> 
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: 14 March 2019 10:59
> To: user@nutch.apache.org
> Subject: RE: OutOfMemoryError: GC overhead limit exceeded
> 
> Hello - 1500 MB is a lot indeed, but 3500 PDF pages is even more. You have no 
> choice, either skip large files, or increase memory.
> 
> Regards,
> Markus
> 
>  
>  
> -Original message-----
>> From:hany.n...@hsbc.com.INVALID 
>> Sent: Thursday 14th March 2019 10:44
>> To: user@nutch.apache.org
>> Subject: OutOfMemoryError: GC overhead limit exceeded
>>
>> Hello,
>>
>> I'm facing OutOfMemoryError: GC overhead limit exceeded exception while 
>> trying to parse pdfs that includes 3500 pages.
>>
>> I increased the JVM RAM to 1500MB; however, I'm still facing the same 
>> problem
>>
>> Please advise
>>
>> 2019-03-08 05:31:55,269 WARN  parse.ParseUtil - Error parsing 
>> http://domain/-/media/files/attachments/common/voting_disclosure_2014
>> _ q2.pdf with org.apache.nutch.parse.tika.TikaParser
>> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC 
>> overhead limit exceeded
>> at 
>> java.util.concurrent.FutureTask.report(FutureTask.java:122)
>> at java.util.concurrent.FutureTask.get(FutureTask.java:206)
>> at 
>> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
>> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
>> at 
>> org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:127)
>> at 
>> org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:78)
>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>> at 
>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>> at 
>> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
>> at 
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> at 
>> java.util.co

Re: OutOfMemoryError: GC overhead limit exceeded

2019-03-14 Thread Sebastian Nagel
Hi,

if running in local mode, it's better passed via ENV to bin/nutch, cf.

# Environment Variables
#
#   NUTCH_JAVA_HOME The java implementation to use.  Overrides JAVA_HOME.
#
#   NUTCH_HEAPSIZE  The maximum amount of heap to use, in MB.
#   Default is 1000.
#
#   NUTCH_OPTS  Extra Java runtime options.
#   Multiple options must be separated by white space.

In distributed mode, please read the Hadoop docs about mapper/reducer memory and
Java heap space.

Best,
Sebastian

On 3/14/19 12:16 PM, hany.n...@hsbc.com.INVALID wrote:
> I'm changing the mapred.child.java.opts=-Xmx1500m in crawl bash file.
> 
> Is it correct?, should I change anywhere else?
> 
> 
> Kind regards, 
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT 
> Corporate Functions | HSBC Operations, Services and Technology (HOST)
> ul. Kapelanka 42A, 30-347 Kraków, Poland
> __ 
> 
> Tie line: 7148 7689 4698 
> External: +48 123 42 0698 
> Mobile: +48 723 680 278 
> E-mail: hany.n...@hsbc.com 
> __ 
> Protect our environment - please only print this if you have to!
> 
> 
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> Sent: 14 March 2019 10:59
> To: user@nutch.apache.org
> Subject: RE: OutOfMemoryError: GC overhead limit exceeded
> 
> Hello - 1500 MB is a lot indeed, but 3500 PDF pages is even more. You have no 
> choice, either skip large files, or increase memory.
> 
> Regards,
> Markus
> 
>  
>  
> -Original message-----
>> From:hany.n...@hsbc.com.INVALID 
>> Sent: Thursday 14th March 2019 10:44
>> To: user@nutch.apache.org
>> Subject: OutOfMemoryError: GC overhead limit exceeded
>>
>> Hello,
>>
>> I'm facing OutOfMemoryError: GC overhead limit exceeded exception while 
>> trying to parse pdfs that includes 3500 pages.
>>
>> I increased the JVM RAM to 1500MB; however, I'm still facing the same 
>> problem
>>
>> Please advise
>>
>> 2019-03-08 05:31:55,269 WARN  parse.ParseUtil - Error parsing 
>> http://domain/-/media/files/attachments/common/voting_disclosure_2014_
>> q2.pdf with org.apache.nutch.parse.tika.TikaParser
>> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC 
>> overhead limit exceeded
>> at 
>> java.util.concurrent.FutureTask.report(FutureTask.java:122)
>> at java.util.concurrent.FutureTask.get(FutureTask.java:206)
>> at 
>> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
>> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
>> at 
>> org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:127)
>> at 
>> org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:78)
>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>> at 
>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>> at 
>> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
>> at 
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> at java.lang.Thread.run(Thread.java:748)
>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>> at 
>> org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:564)
>> at 
>> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
>> at 
>> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
>> at 
>> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>> at 
>> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>> at 
>> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
>> at 
>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:1

RE: OutOfMemoryError: GC overhead limit exceeded

2019-03-14 Thread hany . nasr
I'm changing the mapred.child.java.opts=-Xmx1500m in crawl bash file.

Is it correct?, should I change anywhere else?


Kind regards, 
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.n...@hsbc.com 
__ 
Protect our environment - please only print this if you have to!


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: 14 March 2019 10:59
To: user@nutch.apache.org
Subject: RE: OutOfMemoryError: GC overhead limit exceeded

Hello - 1500 MB is a lot indeed, but 3500 PDF pages is even more. You have no 
choice, either skip large files, or increase memory.

Regards,
Markus

 
 
-Original message-
> From:hany.n...@hsbc.com.INVALID 
> Sent: Thursday 14th March 2019 10:44
> To: user@nutch.apache.org
> Subject: OutOfMemoryError: GC overhead limit exceeded
> 
> Hello,
> 
> I'm facing OutOfMemoryError: GC overhead limit exceeded exception while 
> trying to parse pdfs that includes 3500 pages.
> 
> I increased the JVM RAM to 1500MB; however, I'm still facing the same 
> problem
> 
> Please advise
> 
> 2019-03-08 05:31:55,269 WARN  parse.ParseUtil - Error parsing 
> http://domain/-/media/files/attachments/common/voting_disclosure_2014_
> q2.pdf with org.apache.nutch.parse.tika.TikaParser
> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC 
> overhead limit exceeded
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:206)
> at 
> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
> at 
> org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:127)
> at 
> org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:78)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
> at 
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
> at 
> org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:564)
> at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
> at 
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> at 
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
> at 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171)
> at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:138)
> at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:79)
> at 
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> at 
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> 
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT Corporate 
> Functions | HSBC Operations, Services and Technology (HOST) ul. 
> Kapelanka 42A, 30-347 Kraków, Poland 
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com<mailto:hany.n...@hsbc.com>
> __
> Protect our en

RE: OutOfMemoryError: GC overhead limit exceeded

2019-03-14 Thread Markus Jelsma
Hello - 1500 MB is a lot indeed, but 3500 PDF pages is even more. You have no 
choice, either skip large files, or increase memory.

Regards,
Markus

 
 
-Original message-
> From:hany.n...@hsbc.com.INVALID 
> Sent: Thursday 14th March 2019 10:44
> To: user@nutch.apache.org
> Subject: OutOfMemoryError: GC overhead limit exceeded
> 
> Hello,
> 
> I'm facing OutOfMemoryError: GC overhead limit exceeded exception while 
> trying to parse pdfs that includes 3500 pages.
> 
> I increased the JVM RAM to 1500MB; however, I'm still facing the same problem
> 
> Please advise
> 
> 2019-03-08 05:31:55,269 WARN  parse.ParseUtil - Error parsing 
> http://domain/-/media/files/attachments/common/voting_disclosure_2014_q2.pdf 
> with org.apache.nutch.parse.tika.TikaParser
> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC 
> overhead limit exceeded
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:206)
> at 
> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
> at 
> org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:127)
> at 
> org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:78)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
> at 
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
> at 
> org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:564)
> at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
> at 
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> at 
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
> at 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171)
> at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:138)
> at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:79)
> at 
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> at 
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> 
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT
> Corporate Functions | HSBC Operations, Services and Technology (HOST)
> ul. Kapelanka 42A, 30-347 Kraków, Poland
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com<mailto:hany.n...@hsbc.com>
> __
> Protect our environment - please only print this if you have to!
> 
> 
> 
> -
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.  
> 
> It may also be legally privileged. If you are not the addressee you may not 
> copy,
> forward, disclose or use any part of it. If you have received this message in 
> error,
> please delete it and all copies from your system and notify the sender 
> immediately by
> return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or 
> virus-free.
> The sender does not accept liability for any errors or omissions.
> 


OutOfMemoryError: GC overhead limit exceeded

2019-03-14 Thread hany . nasr
Hello,

I'm facing OutOfMemoryError: GC overhead limit exceeded exception while trying 
to parse pdfs that includes 3500 pages.

I increased the JVM RAM to 1500MB; however, I'm still facing the same problem

Please advise

2019-03-08 05:31:55,269 WARN  parse.ParseUtil - Error parsing 
http://domain/-/media/files/attachments/common/voting_disclosure_2014_q2.pdf 
with org.apache.nutch.parse.tika.TikaParser
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC 
overhead limit exceeded
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:206)
at 
org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
at 
org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:127)
at 
org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:78)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at 
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at 
org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:564)
at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
at 
org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
at 
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at 
org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
at 
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171)
at 
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:138)
at 
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:79)
at 
org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
at 
org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)

Kind regards,
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__

Tie line: 7148 7689 4698
External: +48 123 42 0698
Mobile: +48 723 680 278
E-mail: hany.n...@hsbc.com<mailto:hany.n...@hsbc.com>
__
Protect our environment - please only print this if you have to!



-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.