It doesn't take that long on my local machine.  It's only when I run it on
the server that I get the hung threads abort.  The problem with the hung
threads aborting isn't that they're hung.  It's that the whole fetch hangs
with no error.  The process never completes.

As for docs, I know I have at least one that's 1 GB in size, and quite a
few in the multiple MB size.

I was finally able to get the fetch to complete by setting the Nutch heap
to 4GB and the HBase heap to 4GB.  I think that was part of my initial
problem.  I had increased the Nutch heap and forgotten about HBase.

However, I'm still getting OOM errors with some of the documents.  I know
previously that Nutch would try and store the entire segment in memory.
 Has this changed with the move to HBase?  Do the files get pushed as soon
as they're fetched or does that happen at the end?

Thanks.


On Tue, Apr 23, 2013 at 3:52 PM, Sebastian Nagel <wastl.na...@googlemail.com
> wrote:

> Hi,
>
> if fetcher.parse is the default (=false) the OOM is caused
> by fetcher itself (not while parsing). Because document content
> is buffered as byte[] (almost no memory overhead):
> - either there some really large docs (GBs)
> - or there are reasonably large docs (few MBs)
>   and too many fetcher threads
>
> The other problem, the hung threads also point to unreasonably large
> documents.
> The hung threads appear after a timeout of 5 min.
>   mapred.task.timeout / fetcher.threads.timeout.divisor
>   10min. / 2 = 5min.
> You can try to enlarge these values, but a single fetch should never take
> 5min.
>
> Sebastian
>
> On 04/23/2013 06:17 PM, Bai Shen wrote:
> > Anything larger than the default http.content.limit.
> >
> > I'm crawling an internal server and we have some large files.  That's
> why I
> > had increased the heap size to 8G.  When I run it locally with a 1G heap
> > and a -1 http.conent.limit the fetch sucessfully completes, throwing OOM
> > errors for the large files.  However, when I do the same fetch on the
> > server with the heap set to 8G, I get the errors I mentioned above.  Even
> > when dropping the heap back to 1G I still get errors.
> >
> >
> > On Tue, Apr 23, 2013 at 11:49 AM, Lewis John Mcgibbney <
> > lewis.mcgibb...@gmail.com> wrote:
> >
> >> can you please give examples of the files which were truncated?
> >> thank you
> >> Lewis
> >>
> >> On Tuesday, April 23, 2013, Bai Shen <baishen.li...@gmail.com> wrote:
> >>> I just set http.content.limit back to the default and my fetch
> completed
> >>> successfully on the server.  However, it truncated several of my files.
> >>>
> >>> Also, my server is running Nutch in local mode as well.  I don't have a
> >>> hadoop cluster.
> >>>
> >>>
> >>> On Mon, Apr 22, 2013 at 3:39 PM, Sebastian Nagel <
> >> wastl.na...@googlemail.com
> >>>> wrote:
> >>>
> >>>>> It's not the documents AFAIK.  I'm crawling the same server and it
> >> works
> >>>> on
> >>>>> my local machine, but not on the server with more ram.  I get the OOM
> >>>>> errors on both, but don't have the aborting hung threads.
> >>>> There could be a couple of reasons why the timeout happens on the
> server
> >>>> but not on the local machine.
> >>>>
> >>>> Can you try to limit http.content.limit and try again?
> >>>>
> >>>> On 04/22/2013 09:17 PM, Bai Shen wrote:
> >>>>> Nutch 2.1
> >>>>> bin/nutch fetch -all
> >>>>> No depth
> >>>>> No topN.  I'm only pulling around 600 documents at the current round.
> >>>>> http.content.limit is -1
> >>>>> fetcher.parse is the default
> >>>>> HBase
> >>>>>
> >>>>> It's not the documents AFAIK.  I'm crawling the same server and it
> >> works
> >>>> on
> >>>>> my local machine, but not on the server with more ram.  I get the OOM
> >>>>> errors on both, but don't have the aborting hung threads.  Also, the
> >>>> fetch
> >>>>> on the local machine completes.  The one on the server does not.
> >>  They're
> >>>>> both running local mode.  I literally copied the directories up to
> the
> >>>>> server and then increased the heap size to 8G.  So I don't know what
> >>>>> configuration difference there could be.
> >>>>>
> >>>>>
> >>>>> On Mon, Apr 22, 2013 at 2:58 PM, Sebastian Nagel <
> >>>> wastl.na...@googlemail.com
> >>>>>> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> more information would be useful:
> >>>>>> - exact Nutch version (2.?)
> >>>>>> - how Nutch is called (eg, via bin/crawl)
> >>>>>> - details of the configuration, esp.
> >>>>>>   -depth
> >>>>>>   -topN
> >>>>>>   http.content.limit
> >>>>>>   fetcher.parse
> >>>>>> - storage back-end
> >>>>>>
> >>>>>> In general, something is wrong. Maybe, some oversized documents
> >>>>>> are crawled. But even for a large PDF (several MB) 2GB heap size
> >>>>>> should be enough.
> >>>>>>
> >>>>>> You can try to identify the documents/URLs which cause the hang-up:
> >>>>>>
> >>>>>>
> >>>>
> >>
> >>
> http://stackoverflow.com/questions/10331440/nutch-fetcher-aborting-with-n-hung-threads
> >>>>>> Also keep track of:
> >>>>>>  https://issues.apache.org/jira/browse/NUTCH-1182
> >>>>>>
> >>>>>> Sebastian
> >>>>>>
> >>>>>> On 04/22/2013 08:18 PM, Bai Shen wrote:
> >>>>>>> I'm crawling a local server.  I have Nutch 2 working on a local
> >> machine
> >>>>>>> with the default 1G heap size.  I got several OOM errors, but the
> >> fetch
> >>>>>>> eventually finishes.
> >>>>>>>
> >>>>>>> In order to get rid of the OOM errors, I moved everything to a
> >> machine
> >>>>>> with
> >>>>>>> more memory and increased the heap size to 8G.  However, I'm still
> >>>>>> getting
> >>>>>>> the OOM errors and now I'm having Nutch abort hung threads.  After
> >> it
> >>>>>>> aborts the hung threads, Nutch itself hangs.
> >>>>>>>
> >>>>>>> Any idea what could be causing this or what to look at?  hadoop.log
> >>>> shows
> >>>>>>> nothing after the "Aborting with 1 hung threads." message.
> >>>>>>>
> >>>>>>> Thanks.
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >> --
> >> *Lewis*
> >>
> >
>
>

Reply via email to