What would your reccommendation be to avoid having the whole fetcher hang?
 I know I've previously seen it correctly work with hung threads, but I'm
not sure what the differences then where.

Well, I still ended up having to set a content limit.  Which is why I'm
wondering how the Nutch Gora integration works.  I didn't see a lot of
documentation on it.

So far Nutch seems to be running okay with the changes I made.  However, I
left it crawling overnight and came back to find that HBase is maxed out
memory wise.  Any suggestions for dealing with that?

Thanks.


On Wed, Apr 24, 2013 at 5:17 PM, Sebastian Nagel <wastl.na...@googlemail.com
> wrote:

> > The problem with the hung
> > threads aborting isn't that they're hung.  It's that the whole fetch
> hangs
> > with no error.  The process never completes.
> Yes, you are right. The threads are still alive, see NUTCH-1182.
> And the fetcher job is not finished after fetcher threads have finished:
> fetched data has to be written to disk/hdfs/storage.
>
> > that Nutch would try and store the entire segment in memory.
> For segments and 1.x: docs in process are hold in memory, but all
> completely fetched
> documents are written ("spilled") to the local disk as soon as the output
> buffer
> is filled up to a certain threshold. At the end, the local data is sorted
> and written
> into the final segment.
>
> > Has this changed with the move to HBase?
> > Do the files get pushed as soon
> > as they're fetched or does that happen at the end?
> Don't know, that's a question for the Gora experts.
>
> > I was finally able to get the fetch to complete by setting the Nutch heap
> > to 4GB and the HBase heap to 4GB.
> A heap size 4 times the document size doesn't seem that much ;-)
>
> On 04/24/2013 01:34 PM, Bai Shen wrote:
> > It doesn't take that long on my local machine.  It's only when I run it
> on
> > the server that I get the hung threads abort.  The problem with the hung
> > threads aborting isn't that they're hung.  It's that the whole fetch
> hangs
> > with no error.  The process never completes.
> >
> > As for docs, I know I have at least one that's 1 GB in size, and quite a
> > few in the multiple MB size.
> >
> > I was finally able to get the fetch to complete by setting the Nutch heap
> > to 4GB and the HBase heap to 4GB.  I think that was part of my initial
> > problem.  I had increased the Nutch heap and forgotten about HBase.
> >
> > However, I'm still getting OOM errors with some of the documents.  I know
> > previously that Nutch would try and store the entire segment in memory.
> >  Has this changed with the move to HBase?  Do the files get pushed as
> soon
> > as they're fetched or does that happen at the end?
> >
> > Thanks.
> >
> >
> > On Tue, Apr 23, 2013 at 3:52 PM, Sebastian Nagel <
> wastl.na...@googlemail.com
> >> wrote:
> >
> >> Hi,
> >>
> >> if fetcher.parse is the default (=false) the OOM is caused
> >> by fetcher itself (not while parsing). Because document content
> >> is buffered as byte[] (almost no memory overhead):
> >> - either there some really large docs (GBs)
> >> - or there are reasonably large docs (few MBs)
> >>   and too many fetcher threads
> >>
> >> The other problem, the hung threads also point to unreasonably large
> >> documents.
> >> The hung threads appear after a timeout of 5 min.
> >>   mapred.task.timeout / fetcher.threads.timeout.divisor
> >>   10min. / 2 = 5min.
> >> You can try to enlarge these values, but a single fetch should never
> take
> >> 5min.
> >>
> >> Sebastian
> >>
> >> On 04/23/2013 06:17 PM, Bai Shen wrote:
> >>> Anything larger than the default http.content.limit.
> >>>
> >>> I'm crawling an internal server and we have some large files.  That's
> >> why I
> >>> had increased the heap size to 8G.  When I run it locally with a 1G
> heap
> >>> and a -1 http.conent.limit the fetch sucessfully completes, throwing
> OOM
> >>> errors for the large files.  However, when I do the same fetch on the
> >>> server with the heap set to 8G, I get the errors I mentioned above.
>  Even
> >>> when dropping the heap back to 1G I still get errors.
> >>>
> >>>
> >>> On Tue, Apr 23, 2013 at 11:49 AM, Lewis John Mcgibbney <
> >>> lewis.mcgibb...@gmail.com> wrote:
> >>>
> >>>> can you please give examples of the files which were truncated?
> >>>> thank you
> >>>> Lewis
> >>>>
> >>>> On Tuesday, April 23, 2013, Bai Shen <baishen.li...@gmail.com> wrote:
> >>>>> I just set http.content.limit back to the default and my fetch
> >> completed
> >>>>> successfully on the server.  However, it truncated several of my
> files.
> >>>>>
> >>>>> Also, my server is running Nutch in local mode as well.  I don't
> have a
> >>>>> hadoop cluster.
> >>>>>
> >>>>>
> >>>>> On Mon, Apr 22, 2013 at 3:39 PM, Sebastian Nagel <
> >>>> wastl.na...@googlemail.com
> >>>>>> wrote:
> >>>>>
> >>>>>>> It's not the documents AFAIK.  I'm crawling the same server and it
> >>>> works
> >>>>>> on
> >>>>>>> my local machine, but not on the server with more ram.  I get the
> OOM
> >>>>>>> errors on both, but don't have the aborting hung threads.
> >>>>>> There could be a couple of reasons why the timeout happens on the
> >> server
> >>>>>> but not on the local machine.
> >>>>>>
> >>>>>> Can you try to limit http.content.limit and try again?
> >>>>>>
> >>>>>> On 04/22/2013 09:17 PM, Bai Shen wrote:
> >>>>>>> Nutch 2.1
> >>>>>>> bin/nutch fetch -all
> >>>>>>> No depth
> >>>>>>> No topN.  I'm only pulling around 600 documents at the current
> round.
> >>>>>>> http.content.limit is -1
> >>>>>>> fetcher.parse is the default
> >>>>>>> HBase
> >>>>>>>
> >>>>>>> It's not the documents AFAIK.  I'm crawling the same server and it
> >>>> works
> >>>>>> on
> >>>>>>> my local machine, but not on the server with more ram.  I get the
> OOM
> >>>>>>> errors on both, but don't have the aborting hung threads.  Also,
> the
> >>>>>> fetch
> >>>>>>> on the local machine completes.  The one on the server does not.
> >>>>  They're
> >>>>>>> both running local mode.  I literally copied the directories up to
> >> the
> >>>>>>> server and then increased the heap size to 8G.  So I don't know
> what
> >>>>>>> configuration difference there could be.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, Apr 22, 2013 at 2:58 PM, Sebastian Nagel <
> >>>>>> wastl.na...@googlemail.com
> >>>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> more information would be useful:
> >>>>>>>> - exact Nutch version (2.?)
> >>>>>>>> - how Nutch is called (eg, via bin/crawl)
> >>>>>>>> - details of the configuration, esp.
> >>>>>>>>   -depth
> >>>>>>>>   -topN
> >>>>>>>>   http.content.limit
> >>>>>>>>   fetcher.parse
> >>>>>>>> - storage back-end
> >>>>>>>>
> >>>>>>>> In general, something is wrong. Maybe, some oversized documents
> >>>>>>>> are crawled. But even for a large PDF (several MB) 2GB heap size
> >>>>>>>> should be enough.
> >>>>>>>>
> >>>>>>>> You can try to identify the documents/URLs which cause the
> hang-up:
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> http://stackoverflow.com/questions/10331440/nutch-fetcher-aborting-with-n-hung-threads
> >>>>>>>> Also keep track of:
> >>>>>>>>  https://issues.apache.org/jira/browse/NUTCH-1182
> >>>>>>>>
> >>>>>>>> Sebastian
> >>>>>>>>
> >>>>>>>> On 04/22/2013 08:18 PM, Bai Shen wrote:
> >>>>>>>>> I'm crawling a local server.  I have Nutch 2 working on a local
> >>>> machine
> >>>>>>>>> with the default 1G heap size.  I got several OOM errors, but the
> >>>> fetch
> >>>>>>>>> eventually finishes.
> >>>>>>>>>
> >>>>>>>>> In order to get rid of the OOM errors, I moved everything to a
> >>>> machine
> >>>>>>>> with
> >>>>>>>>> more memory and increased the heap size to 8G.  However, I'm
> still
> >>>>>>>> getting
> >>>>>>>>> the OOM errors and now I'm having Nutch abort hung threads.
>  After
> >>>> it
> >>>>>>>>> aborts the hung threads, Nutch itself hangs.
> >>>>>>>>>
> >>>>>>>>> Any idea what could be causing this or what to look at?
>  hadoop.log
> >>>>>> shows
> >>>>>>>>> nothing after the "Aborting with 1 hung threads." message.
> >>>>>>>>>
> >>>>>>>>> Thanks.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>> --
> >>>> *Lewis*
> >>>>
> >>>
> >>
> >>
> >
>
>

Reply via email to