RE: OutOfMemoryError: GC overhead limit exceeded

2019-03-18 Thread hany . nasr
Hi, JIRA Ticket is created: https://issues.apache.org/jira/browse/NUTCH-2703 I'm able to crawl the website and these huge pdfs with 500MB JVM heap without Boilerpipe. Enabling Boilerpipe forced me to increase the JVM heap to 8500MB. Hope this bug can be fixed in Nutch 1.16. Kind regards,

RE: Increasing the number of reducer in UpdateHostDB

2019-03-18 Thread Markus Jelsma
Hello Suraj, You can safely increase the number of reducers for UpdateHostDB to as high as you like. Regards, Markus -Original message- > From:Suraj Singh > Sent: Monday 18th March 2019 11:41 > To: user@nutch.apache.org > Subject: Increasing the number of reducer in UpdateHostDB > >

Re: OutOfMemoryError: GC overhead limit exceeded

2019-03-18 Thread Sebastian Nagel
Hi, good point. Maybe we should implement a limit on the usage of boilerpipe: - either by MIME type (only HTML types) I doubt that boilerpipe has been implemented for any formats except HTML - or by document size (or size of the DOM tree) Please open a Jira issue to implement this. But you

RE: Increasing the number of reducer in UpdateHostDB

2019-03-18 Thread Suraj Singh
Thank you Markus. -Original Message- From: Markus Jelsma Sent: Monday, 18 March 2019 11:49 To: user@nutch.apache.org Subject: RE: Increasing the number of reducer in UpdateHostDB Hello Suraj, You can safely increase the number of reducers for UpdateHostDB to as high as you like.

RE: OutOfMemoryError: GC overhead limit exceeded

2019-03-18 Thread Markus Jelsma
Hello Hany, If you deal with large PDF files, and you get an OOM with this stack trace, it is highly unlikely due to Boilerpipe being active. Boilerpipe does not run before PDFBox is finished so you should really increase the heap. Of course, to answer the question, Boilerpipe should not run

RE: OutOfMemoryError: GC overhead limit exceeded

2019-03-18 Thread hany . nasr
Hi, Is there any workaround for now to exclude pdfs from the usage of boilerpipe? Kind regards, Hany Shehata Enterprise Engineer Green Six Sigma Certified Solutions Architect, Marketing and Communications IT Corporate Functions | HSBC Operations, Services and Technology (HOST) ul. Kapelanka

RE: OutOfMemoryError: GC overhead limit exceeded

2019-03-18 Thread hany . nasr
Hello Markus, I am able to parse these pdfs without increasing the heap. If tika extractor is none. I did increase the heap with Boilerpipe enabled and didn't work by giving me "failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content", then OOM. Kind

Increasing the number of reducer in UpdateHostDB

2019-03-18 Thread Suraj Singh
Hi All, Can I increase the number of reducer in UpdateHostDB step? Currently it is running with 1 reducer. Will it impact the crawling in any way? Current command in crawl script: __bin_nutch updatehostdb -crawldb "$CRAWL_PATH"/crawldb -hostdb "$CRAWL_PATH"/hostdb Can I update it to:

RE: OutOfMemoryError: GC overhead limit exceeded

2019-03-18 Thread hany . nasr
Hi, I found the root cause and it is not related to JVM Heap Size. The problem of parsing these pdfs happen when I enable the tika extractor to be boilerpipe. Boilerpipe article extractor is working perfectly with other pdfs and pages; when I disable it, Tika is able to parse and index these