Anything larger than the default http.content.limit.

I'm crawling an internal server and we have some large files.  That's why I
had increased the heap size to 8G.  When I run it locally with a 1G heap
and a -1 http.conent.limit the fetch sucessfully completes, throwing OOM
errors for the large files.  However, when I do the same fetch on the
server with the heap set to 8G, I get the errors I mentioned above.  Even
when dropping the heap back to 1G I still get errors.


On Tue, Apr 23, 2013 at 11:49 AM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> can you please give examples of the files which were truncated?
> thank you
> Lewis
>
> On Tuesday, April 23, 2013, Bai Shen <baishen.li...@gmail.com> wrote:
> > I just set http.content.limit back to the default and my fetch completed
> > successfully on the server.  However, it truncated several of my files.
> >
> > Also, my server is running Nutch in local mode as well.  I don't have a
> > hadoop cluster.
> >
> >
> > On Mon, Apr 22, 2013 at 3:39 PM, Sebastian Nagel <
> wastl.na...@googlemail.com
> >> wrote:
> >
> >> > It's not the documents AFAIK.  I'm crawling the same server and it
> works
> >> on
> >> > my local machine, but not on the server with more ram.  I get the OOM
> >> > errors on both, but don't have the aborting hung threads.
> >> There could be a couple of reasons why the timeout happens on the server
> >> but not on the local machine.
> >>
> >> Can you try to limit http.content.limit and try again?
> >>
> >> On 04/22/2013 09:17 PM, Bai Shen wrote:
> >> > Nutch 2.1
> >> > bin/nutch fetch -all
> >> > No depth
> >> > No topN.  I'm only pulling around 600 documents at the current round.
> >> > http.content.limit is -1
> >> > fetcher.parse is the default
> >> > HBase
> >> >
> >> > It's not the documents AFAIK.  I'm crawling the same server and it
> works
> >> on
> >> > my local machine, but not on the server with more ram.  I get the OOM
> >> > errors on both, but don't have the aborting hung threads.  Also, the
> >> fetch
> >> > on the local machine completes.  The one on the server does not.
>  They're
> >> > both running local mode.  I literally copied the directories up to the
> >> > server and then increased the heap size to 8G.  So I don't know what
> >> > configuration difference there could be.
> >> >
> >> >
> >> > On Mon, Apr 22, 2013 at 2:58 PM, Sebastian Nagel <
> >> wastl.na...@googlemail.com
> >> >> wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> more information would be useful:
> >> >> - exact Nutch version (2.?)
> >> >> - how Nutch is called (eg, via bin/crawl)
> >> >> - details of the configuration, esp.
> >> >>   -depth
> >> >>   -topN
> >> >>   http.content.limit
> >> >>   fetcher.parse
> >> >> - storage back-end
> >> >>
> >> >> In general, something is wrong. Maybe, some oversized documents
> >> >> are crawled. But even for a large PDF (several MB) 2GB heap size
> >> >> should be enough.
> >> >>
> >> >> You can try to identify the documents/URLs which cause the hang-up:
> >> >>
> >> >>
> >>
>
> http://stackoverflow.com/questions/10331440/nutch-fetcher-aborting-with-n-hung-threads
> >> >> Also keep track of:
> >> >>  https://issues.apache.org/jira/browse/NUTCH-1182
> >> >>
> >> >> Sebastian
> >> >>
> >> >> On 04/22/2013 08:18 PM, Bai Shen wrote:
> >> >>> I'm crawling a local server.  I have Nutch 2 working on a local
> machine
> >> >>> with the default 1G heap size.  I got several OOM errors, but the
> fetch
> >> >>> eventually finishes.
> >> >>>
> >> >>> In order to get rid of the OOM errors, I moved everything to a
> machine
> >> >> with
> >> >>> more memory and increased the heap size to 8G.  However, I'm still
> >> >> getting
> >> >>> the OOM errors and now I'm having Nutch abort hung threads.  After
> it
> >> >>> aborts the hung threads, Nutch itself hangs.
> >> >>>
> >> >>> Any idea what could be causing this or what to look at?  hadoop.log
> >> shows
> >> >>> nothing after the "Aborting with 1 hung threads." message.
> >> >>>
> >> >>> Thanks.
> >> >>>
> >> >>
> >> >>
> >> >
> >>
> >>
> >
>
> --
> *Lewis*
>

Reply via email to