Hi Ye

Do you run nutch in local mode?

48hr to parse 100k urls, maybe each url spend 1.7s/page. i think the mainly
time spend on ParseSegment include parse html and write parse data to DFS,
that include text to parse_text, data to parse_data and links to
crawl_parse.

Maybe you can run the nutch in cluster using deploy mode. It will make full
use of the MR distributed computing capabilities




On Sat, Mar 9, 2013 at 11:09 AM, Ye T Thet <[email protected]> wrote:

> Hi,
>
> It is *NOT* about the fetch. It is about the parse.
>
> I am using Nutch 1.2. 500 hosts. The fetcher is doing ok with default
> setting 1 thread per host and 100 concurrent thread.
>
> It bottle neck is at the parse.
>
> Regards,
>
> On Sat, Mar 9, 2013 at 12:27 AM, kiran chitturi
> <[email protected]>wrote:
>
> > Hi!
> >
> > Which version of Nutch are you using ?
> >
> > Can you please give some more details on your configuration like
> >
> > i) How many threads you are using per queue ?
> > ii) Are the URLs you are crawling belong to a single host or different
> > hosts ?
> >
> > There are lot of things that come in to affect AFAIK while trying to find
> > the optimal/decent performance.
> >
> > If you are using one thread per queue and crawling on a single host, it
> is
> > gonna take quite some time because of the politeness policy and there is
> a
> > five second gap.
> >
> > The configurations can be changed and there is discussion on the mailing
> > list.
> >
> > I had a similar problem where fetcher is taking very long with a big
> topN,
> > and i have attained stability by changing the configuration. Please find
> > the discussion at [1].
> >
> >
> >
> http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-td4044231.html
> >
> > Hope this Helps.
> >
> >
> >
> >
> >
> > On Fri, Mar 8, 2013 at 11:12 AM, Ye T Thet <[email protected]>
> wrote:
> >
> > > Good day folks,
> > >
> > > I have a question about the benchmarking on ParseSegment of the Nutch.
> > >
> > > I separated the fetch and parse process. Thus the URLs are fetched then
> > the
> > > fetched contents for the current depth is parsed. I believe that is the
> > > recommended approach to gain performance as well.
> > >
> > > I am running on Amazon EC2, Medium Instance, Ubuntu image, spec of 3.7
> > GiB
> > > and 2 core. I am only parsing the HTML contents (using HTML parsers).
> > > Fetched URL is around *100k* and the content size is *3 GiB*. I have
> been
> > > experimenting to get the benchmark on the crawlers (fetch, parse
> > circle). I
> > > have crawled few data sets. According to my observation, to parse the
> > HTML
> > > documents of 100k URL estimate of 3 GiB took me around 32hrs to 48 hrs
> > via
> > > ParseSegment. Through out the parsing process CPU and Memory
> utilization
> > is
> > > almost 100%.
> > >
> > > Has anyone got performance benchmark on the ParseSegment? Does 48 hrs
> to
> > > parse 3Gib of contents from 100k URLs sounds reasonable? If it is too
> > off,
> > > is there any direction I should explore to gain the performance? Like
> > > tweaking ParseSegment flow?
> > >
> > > Apology for too many questions. Thanks in advance.
> > >
> > > Cheers,
> > >
> > > Ye
> > >
> >
> >
> >
> > --
> > Kiran Chitturi
> >
>



-- 
Don't Grow Old, Grow Up... :-)

Reply via email to