What would your reccommendation be to avoid having the whole fetcher hang? I know I've previously seen it correctly work with hung threads, but I'm not sure what the differences then where.
Well, I still ended up having to set a content limit. Which is why I'm wondering how the Nutch Gora integration works. I didn't see a lot of documentation on it. So far Nutch seems to be running okay with the changes I made. However, I left it crawling overnight and came back to find that HBase is maxed out memory wise. Any suggestions for dealing with that? Thanks. On Wed, Apr 24, 2013 at 5:17 PM, Sebastian Nagel <wastl.na...@googlemail.com > wrote: > > The problem with the hung > > threads aborting isn't that they're hung. It's that the whole fetch > hangs > > with no error. The process never completes. > Yes, you are right. The threads are still alive, see NUTCH-1182. > And the fetcher job is not finished after fetcher threads have finished: > fetched data has to be written to disk/hdfs/storage. > > > that Nutch would try and store the entire segment in memory. > For segments and 1.x: docs in process are hold in memory, but all > completely fetched > documents are written ("spilled") to the local disk as soon as the output > buffer > is filled up to a certain threshold. At the end, the local data is sorted > and written > into the final segment. > > > Has this changed with the move to HBase? > > Do the files get pushed as soon > > as they're fetched or does that happen at the end? > Don't know, that's a question for the Gora experts. > > > I was finally able to get the fetch to complete by setting the Nutch heap > > to 4GB and the HBase heap to 4GB. > A heap size 4 times the document size doesn't seem that much ;-) > > On 04/24/2013 01:34 PM, Bai Shen wrote: > > It doesn't take that long on my local machine. It's only when I run it > on > > the server that I get the hung threads abort. The problem with the hung > > threads aborting isn't that they're hung. It's that the whole fetch > hangs > > with no error. The process never completes. > > > > As for docs, I know I have at least one that's 1 GB in size, and quite a > > few in the multiple MB size. > > > > I was finally able to get the fetch to complete by setting the Nutch heap > > to 4GB and the HBase heap to 4GB. I think that was part of my initial > > problem. I had increased the Nutch heap and forgotten about HBase. > > > > However, I'm still getting OOM errors with some of the documents. I know > > previously that Nutch would try and store the entire segment in memory. > > Has this changed with the move to HBase? Do the files get pushed as > soon > > as they're fetched or does that happen at the end? > > > > Thanks. > > > > > > On Tue, Apr 23, 2013 at 3:52 PM, Sebastian Nagel < > wastl.na...@googlemail.com > >> wrote: > > > >> Hi, > >> > >> if fetcher.parse is the default (=false) the OOM is caused > >> by fetcher itself (not while parsing). Because document content > >> is buffered as byte[] (almost no memory overhead): > >> - either there some really large docs (GBs) > >> - or there are reasonably large docs (few MBs) > >> and too many fetcher threads > >> > >> The other problem, the hung threads also point to unreasonably large > >> documents. > >> The hung threads appear after a timeout of 5 min. > >> mapred.task.timeout / fetcher.threads.timeout.divisor > >> 10min. / 2 = 5min. > >> You can try to enlarge these values, but a single fetch should never > take > >> 5min. > >> > >> Sebastian > >> > >> On 04/23/2013 06:17 PM, Bai Shen wrote: > >>> Anything larger than the default http.content.limit. > >>> > >>> I'm crawling an internal server and we have some large files. That's > >> why I > >>> had increased the heap size to 8G. When I run it locally with a 1G > heap > >>> and a -1 http.conent.limit the fetch sucessfully completes, throwing > OOM > >>> errors for the large files. However, when I do the same fetch on the > >>> server with the heap set to 8G, I get the errors I mentioned above. > Even > >>> when dropping the heap back to 1G I still get errors. > >>> > >>> > >>> On Tue, Apr 23, 2013 at 11:49 AM, Lewis John Mcgibbney < > >>> lewis.mcgibb...@gmail.com> wrote: > >>> > >>>> can you please give examples of the files which were truncated? > >>>> thank you > >>>> Lewis > >>>> > >>>> On Tuesday, April 23, 2013, Bai Shen <baishen.li...@gmail.com> wrote: > >>>>> I just set http.content.limit back to the default and my fetch > >> completed > >>>>> successfully on the server. However, it truncated several of my > files. > >>>>> > >>>>> Also, my server is running Nutch in local mode as well. I don't > have a > >>>>> hadoop cluster. > >>>>> > >>>>> > >>>>> On Mon, Apr 22, 2013 at 3:39 PM, Sebastian Nagel < > >>>> wastl.na...@googlemail.com > >>>>>> wrote: > >>>>> > >>>>>>> It's not the documents AFAIK. I'm crawling the same server and it > >>>> works > >>>>>> on > >>>>>>> my local machine, but not on the server with more ram. I get the > OOM > >>>>>>> errors on both, but don't have the aborting hung threads. > >>>>>> There could be a couple of reasons why the timeout happens on the > >> server > >>>>>> but not on the local machine. > >>>>>> > >>>>>> Can you try to limit http.content.limit and try again? > >>>>>> > >>>>>> On 04/22/2013 09:17 PM, Bai Shen wrote: > >>>>>>> Nutch 2.1 > >>>>>>> bin/nutch fetch -all > >>>>>>> No depth > >>>>>>> No topN. I'm only pulling around 600 documents at the current > round. > >>>>>>> http.content.limit is -1 > >>>>>>> fetcher.parse is the default > >>>>>>> HBase > >>>>>>> > >>>>>>> It's not the documents AFAIK. I'm crawling the same server and it > >>>> works > >>>>>> on > >>>>>>> my local machine, but not on the server with more ram. I get the > OOM > >>>>>>> errors on both, but don't have the aborting hung threads. Also, > the > >>>>>> fetch > >>>>>>> on the local machine completes. The one on the server does not. > >>>> They're > >>>>>>> both running local mode. I literally copied the directories up to > >> the > >>>>>>> server and then increased the heap size to 8G. So I don't know > what > >>>>>>> configuration difference there could be. > >>>>>>> > >>>>>>> > >>>>>>> On Mon, Apr 22, 2013 at 2:58 PM, Sebastian Nagel < > >>>>>> wastl.na...@googlemail.com > >>>>>>>> wrote: > >>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> more information would be useful: > >>>>>>>> - exact Nutch version (2.?) > >>>>>>>> - how Nutch is called (eg, via bin/crawl) > >>>>>>>> - details of the configuration, esp. > >>>>>>>> -depth > >>>>>>>> -topN > >>>>>>>> http.content.limit > >>>>>>>> fetcher.parse > >>>>>>>> - storage back-end > >>>>>>>> > >>>>>>>> In general, something is wrong. Maybe, some oversized documents > >>>>>>>> are crawled. But even for a large PDF (several MB) 2GB heap size > >>>>>>>> should be enough. > >>>>>>>> > >>>>>>>> You can try to identify the documents/URLs which cause the > hang-up: > >>>>>>>> > >>>>>>>> > >>>>>> > >>>> > >>>> > >> > http://stackoverflow.com/questions/10331440/nutch-fetcher-aborting-with-n-hung-threads > >>>>>>>> Also keep track of: > >>>>>>>> https://issues.apache.org/jira/browse/NUTCH-1182 > >>>>>>>> > >>>>>>>> Sebastian > >>>>>>>> > >>>>>>>> On 04/22/2013 08:18 PM, Bai Shen wrote: > >>>>>>>>> I'm crawling a local server. I have Nutch 2 working on a local > >>>> machine > >>>>>>>>> with the default 1G heap size. I got several OOM errors, but the > >>>> fetch > >>>>>>>>> eventually finishes. > >>>>>>>>> > >>>>>>>>> In order to get rid of the OOM errors, I moved everything to a > >>>> machine > >>>>>>>> with > >>>>>>>>> more memory and increased the heap size to 8G. However, I'm > still > >>>>>>>> getting > >>>>>>>>> the OOM errors and now I'm having Nutch abort hung threads. > After > >>>> it > >>>>>>>>> aborts the hung threads, Nutch itself hangs. > >>>>>>>>> > >>>>>>>>> Any idea what could be causing this or what to look at? > hadoop.log > >>>>>> shows > >>>>>>>>> nothing after the "Aborting with 1 hung threads." message. > >>>>>>>>> > >>>>>>>>> Thanks. > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>> > >>>> > >>>> -- > >>>> *Lewis* > >>>> > >>> > >> > >> > > > >