Hi,
JIRA Ticket is created: https://issues.apache.org/jira/browse/NUTCH-2703
I'm able to crawl the website and these huge pdfs with 500MB JVM heap without
Boilerpipe.
Enabling Boilerpipe forced me to increase the JVM heap to 8500MB.
Hope this bug can be fixed in Nutch 1.16.
Kind regards,
Hello Suraj,
You can safely increase the number of reducers for UpdateHostDB to as high as
you like.
Regards,
Markus
-Original message-
> From:Suraj Singh
> Sent: Monday 18th March 2019 11:41
> To: user@nutch.apache.org
> Subject: Increasing the number of reducer in UpdateHostDB
>
>
Hi,
good point.
Maybe we should implement a limit on the usage of boilerpipe:
- either by MIME type (only HTML types)
I doubt that boilerpipe has been implemented for any formats except HTML
- or by document size (or size of the DOM tree)
Please open a Jira issue to implement this.
But you
Thank you Markus.
-Original Message-
From: Markus Jelsma
Sent: Monday, 18 March 2019 11:49
To: user@nutch.apache.org
Subject: RE: Increasing the number of reducer in UpdateHostDB
Hello Suraj,
You can safely increase the number of reducers for UpdateHostDB to as high as
you like.
Hello Hany,
If you deal with large PDF files, and you get an OOM with this stack trace, it
is highly unlikely due to Boilerpipe being active. Boilerpipe does not run
before PDFBox is finished so you should really increase the heap.
Of course, to answer the question, Boilerpipe should not run
Hi,
Is there any workaround for now to exclude pdfs from the usage of boilerpipe?
Kind regards,
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka
Hello Markus,
I am able to parse these pdfs without increasing the heap. If tika extractor is
none.
I did increase the heap with Boilerpipe enabled and didn't work by giving me
"failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully
parse content", then OOM.
Kind
Hi All,
Can I increase the number of reducer in UpdateHostDB step? Currently it is
running with 1 reducer.
Will it impact the crawling in any way?
Current command in crawl script:
__bin_nutch updatehostdb -crawldb "$CRAWL_PATH"/crawldb -hostdb
"$CRAWL_PATH"/hostdb
Can I update it to:
Hi,
I found the root cause and it is not related to JVM Heap Size.
The problem of parsing these pdfs happen when I enable the tika extractor to be
boilerpipe.
Boilerpipe article extractor is working perfectly with other pdfs and pages;
when I disable it, Tika is able to parse and index these
9 matches
Mail list logo