Anything larger than the default http.content.limit. I'm crawling an internal server and we have some large files. That's why I had increased the heap size to 8G. When I run it locally with a 1G heap and a -1 http.conent.limit the fetch sucessfully completes, throwing OOM errors for the large files. However, when I do the same fetch on the server with the heap set to 8G, I get the errors I mentioned above. Even when dropping the heap back to 1G I still get errors.
On Tue, Apr 23, 2013 at 11:49 AM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > can you please give examples of the files which were truncated? > thank you > Lewis > > On Tuesday, April 23, 2013, Bai Shen <baishen.li...@gmail.com> wrote: > > I just set http.content.limit back to the default and my fetch completed > > successfully on the server. However, it truncated several of my files. > > > > Also, my server is running Nutch in local mode as well. I don't have a > > hadoop cluster. > > > > > > On Mon, Apr 22, 2013 at 3:39 PM, Sebastian Nagel < > wastl.na...@googlemail.com > >> wrote: > > > >> > It's not the documents AFAIK. I'm crawling the same server and it > works > >> on > >> > my local machine, but not on the server with more ram. I get the OOM > >> > errors on both, but don't have the aborting hung threads. > >> There could be a couple of reasons why the timeout happens on the server > >> but not on the local machine. > >> > >> Can you try to limit http.content.limit and try again? > >> > >> On 04/22/2013 09:17 PM, Bai Shen wrote: > >> > Nutch 2.1 > >> > bin/nutch fetch -all > >> > No depth > >> > No topN. I'm only pulling around 600 documents at the current round. > >> > http.content.limit is -1 > >> > fetcher.parse is the default > >> > HBase > >> > > >> > It's not the documents AFAIK. I'm crawling the same server and it > works > >> on > >> > my local machine, but not on the server with more ram. I get the OOM > >> > errors on both, but don't have the aborting hung threads. Also, the > >> fetch > >> > on the local machine completes. The one on the server does not. > They're > >> > both running local mode. I literally copied the directories up to the > >> > server and then increased the heap size to 8G. So I don't know what > >> > configuration difference there could be. > >> > > >> > > >> > On Mon, Apr 22, 2013 at 2:58 PM, Sebastian Nagel < > >> wastl.na...@googlemail.com > >> >> wrote: > >> > > >> >> Hi, > >> >> > >> >> more information would be useful: > >> >> - exact Nutch version (2.?) > >> >> - how Nutch is called (eg, via bin/crawl) > >> >> - details of the configuration, esp. > >> >> -depth > >> >> -topN > >> >> http.content.limit > >> >> fetcher.parse > >> >> - storage back-end > >> >> > >> >> In general, something is wrong. Maybe, some oversized documents > >> >> are crawled. But even for a large PDF (several MB) 2GB heap size > >> >> should be enough. > >> >> > >> >> You can try to identify the documents/URLs which cause the hang-up: > >> >> > >> >> > >> > > http://stackoverflow.com/questions/10331440/nutch-fetcher-aborting-with-n-hung-threads > >> >> Also keep track of: > >> >> https://issues.apache.org/jira/browse/NUTCH-1182 > >> >> > >> >> Sebastian > >> >> > >> >> On 04/22/2013 08:18 PM, Bai Shen wrote: > >> >>> I'm crawling a local server. I have Nutch 2 working on a local > machine > >> >>> with the default 1G heap size. I got several OOM errors, but the > fetch > >> >>> eventually finishes. > >> >>> > >> >>> In order to get rid of the OOM errors, I moved everything to a > machine > >> >> with > >> >>> more memory and increased the heap size to 8G. However, I'm still > >> >> getting > >> >>> the OOM errors and now I'm having Nutch abort hung threads. After > it > >> >>> aborts the hung threads, Nutch itself hangs. > >> >>> > >> >>> Any idea what could be causing this or what to look at? hadoop.log > >> shows > >> >>> nothing after the "Aborting with 1 hung threads." message. > >> >>> > >> >>> Thanks. > >> >>> > >> >> > >> >> > >> > > >> > >> > > > > -- > *Lewis* >