Hi,

It is *NOT* about the fetch. It is about the parse.

I am using Nutch 1.2. 500 hosts. The fetcher is doing ok with default
setting 1 thread per host and 100 concurrent thread.

It bottle neck is at the parse.

Regards,

On Sat, Mar 9, 2013 at 12:27 AM, kiran chitturi
<[email protected]>wrote:

> Hi!
>
> Which version of Nutch are you using ?
>
> Can you please give some more details on your configuration like
>
> i) How many threads you are using per queue ?
> ii) Are the URLs you are crawling belong to a single host or different
> hosts ?
>
> There are lot of things that come in to affect AFAIK while trying to find
> the optimal/decent performance.
>
> If you are using one thread per queue and crawling on a single host, it is
> gonna take quite some time because of the politeness policy and there is a
> five second gap.
>
> The configurations can be changed and there is discussion on the mailing
> list.
>
> I had a similar problem where fetcher is taking very long with a big topN,
> and i have attained stability by changing the configuration. Please find
> the discussion at [1].
>
>
> http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-td4044231.html
>
> Hope this Helps.
>
>
>
>
>
> On Fri, Mar 8, 2013 at 11:12 AM, Ye T Thet <[email protected]> wrote:
>
> > Good day folks,
> >
> > I have a question about the benchmarking on ParseSegment of the Nutch.
> >
> > I separated the fetch and parse process. Thus the URLs are fetched then
> the
> > fetched contents for the current depth is parsed. I believe that is the
> > recommended approach to gain performance as well.
> >
> > I am running on Amazon EC2, Medium Instance, Ubuntu image, spec of 3.7
> GiB
> > and 2 core. I am only parsing the HTML contents (using HTML parsers).
> > Fetched URL is around *100k* and the content size is *3 GiB*. I have been
> > experimenting to get the benchmark on the crawlers (fetch, parse
> circle). I
> > have crawled few data sets. According to my observation, to parse the
> HTML
> > documents of 100k URL estimate of 3 GiB took me around 32hrs to 48 hrs
> via
> > ParseSegment. Through out the parsing process CPU and Memory utilization
> is
> > almost 100%.
> >
> > Has anyone got performance benchmark on the ParseSegment? Does 48 hrs to
> > parse 3Gib of contents from 100k URLs sounds reasonable? If it is too
> off,
> > is there any direction I should explore to gain the performance? Like
> > tweaking ParseSegment flow?
> >
> > Apology for too many questions. Thanks in advance.
> >
> > Cheers,
> >
> > Ye
> >
>
>
>
> --
> Kiran Chitturi
>

Reply via email to