Re: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)

S.L Fri, 20 Dec 2013 08:32:18 -0800

Markus,

The OOM was occuring because I was using 30 threads in my ocnfig file ,
when I reduced my thread count to 10 its not occuring any more , I am
assuming this was a resource issue as my current machine would probably not
handle more than a certain threads per each Map task.


Also would you happen to know how many map reduce tasks are created by
Hadoop in a pseudo clustered mode by default?

Thanks.


On Fri, Dec 20, 2013 at 7:51 AM, Markus Jelsma
<[email protected]>wrote:

> Hi - if you're not using NutchDocument and all other interesting stuff
> from IndexerMapReduce and stuff but just write a simple document to Solr
> you should not need to use Reducer and IndexerOutputFormat. Perhaps the OOM
> occurs because you continously create new objects, hence the logging of
> HttpClientUtil you reported earlier. Those objects should either be cleaned
> up but preferrable reused. You should have only one object per thread.
>
> -----Original message-----
> > From:S.L <[email protected]>
> > Sent: Thursday 19th December 2013 16:53
> > To: [email protected]
> > Subject: Re: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)
> >
> > Markus,
> >
> > I am not very clear about the suggested route due to my unfamiliarity
> with
> > Nutch. All I am doing in the Fetcher is intercepting the URL , extracting
> > precise data and populating Solr using a Utility methos from a third
> party
> > jar. What would I gain from doing this in a Reduce phase suggested by you
> > instead of the apparent Map phase I am doing it in. I would really
> > appreciate if you could clear that.
> >
> > Also , after I started running Nutch1.7 in Hadoop 2.2 I started getting
> OOM
> > exceptions , I have already increased my HADOOP_HEAPSIZE to 4GB on a 8GB
> > laptop, this never happened when I was running in a lcoal mode and I was
> > able to crawl large ( few 100Ks ) easily. Can you please also let me know
> > what the issue might be ?
> >
> > Thanks again and much appreciated !
> >
> >
> > On Tue, Dec 17, 2013 at 4:09 AM, Markus Jelsma
> > <[email protected]>wrote:
> >
> > > In Fetcher's  protected ParseStatus output(Text key, CrawlDatum datum,
> > > Content content, ProtocolStatus pstatus, int status, int outlinkDepth)
> you
> > > need to output a NutchIndexAction object for every add or delete you
> get.
> > > By comparing signature you can skip not_modified pages. You also need
> to
> > > check for robots=noindex and 404 pages.You need to build up a
> > > NutchDocument, pass it though indexing filters etc, just like the
> current
> > > IndexerMapReduce. In FetcherOutputFormat you need to create an
> > > IndexerOutputFormat just like the ParseOutputFormat that is already
> there,
> > > and handle the incoming NutchIndexAction in RecordWriter.write().
> > >
> > > -----Original message-----
> > > > From:S.L <[email protected]>
> > > > Sent: Monday 16th December 2013 18:00
> > > > To: [email protected]
> > > > Subject: Re: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)
> > > >
> > > > Markus,
> > > >
> > > > Actually I started doing this after you replied to my queries few
> months
> > > > ago , I did not face this issue when I was running Nutch in a local
> mode
> > > ,
> > > > seems like this issue shows up when running in a deploy(single node
> > > > cluster) mode.
> > > >
> > > > I will go ahead and change this to index the document in the
> > > > FetcherOutputFormat class( would you please tell me the line number
> to
> > > > insert th code at ?).
> > > >
> > > > However I was wondering if I would able to leverage the plugin
> mechanism
> > > to
> > > > do this and if there is any Solr  plugin that takes the parsed text
> from
> > > > the URL and indexes it based on some transformation that I do  ?
> > > >
> > > > I really appreciate your help .
> > > >
> > > > Thanks.
> > > >
> > > >
> > > > On Mon, Dec 16, 2013 at 10:08 AM, Markus Jelsma
> > > > <[email protected]>wrote:
> > > >
> > > > > You have modified the Fetcher to index documents? In that case, you
> > > should
> > > > > index in the reducer (FetcherOutputFormat), not while mapping, and
> > > reuse
> > > > > the existing indexing code of SolrWriter. In any case, you should
> not
> > > > > create a client per document.
> > > > >
> > > > > -----Original message-----
> > > > > > From:S.L <[email protected]>
> > > > > > Sent: Monday 16th December 2013 15:57
> > > > > > To: [email protected]; [email protected]
> > > > > > Subject: Re: Excessive HttpClient creation (Nutch 1.7 on Hadoop
> 2.2)
> > > > > >
> > > > > > Markus,
> > > > > >
> > > > > >
> > > > > > Yes you are right FetcherThread does not use SolrJ by itself ,I
> am
> > > > > adding a call to Solr to save the data. I am concerned about the
> > > number of
> > > > > HttpClients being created,it.seems its creating a client per a
> > > document i
> > > > > am saving in Solr,thiscould be expected but I just want to confirm.
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > Sent from my HTC Inspire™ 4G on AT&T
> > > > > >
> > > > > > ----- Reply message -----
> > > > > > From: "Markus Jelsma" <[email protected]>
> > > > > > To: "[email protected]" <[email protected]>
> > > > > > Subject: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)
> > > > > > Date: Mon, Dec 16, 2013 6:27 am
> > > > > >
> > > > > >
> > > > > > Hi - How can this be in FetcherThread, Nutch does not use SolrJ
> in
> > > > > Fetcher. Do you have the entire Fetcher log?
> > > > > >
> > > > > > -----Original message-----
> > > > > > > From:S.L <[email protected]>
> > > > > > > Sent: Monday 16th December 2013 6:40
> > > > > > > To: [email protected]
> > > > > > > Subject: Excessive HttpClient creation (Nutch 1.7 on Hadoop
> 2.2)
> > > > > > >
> > > > > > > Hi Folks,
> > > > > > >
> > > > > > > I am running Nutch 1.7 on Hadoop 2.2 and in the Hadoop logs for
> > > > > > > FetcherThread, I see the following statements , which tells me
> > > that the
> > > > > > > HttpCleints are being created per URL, is this correct
> assumption?
> > > Also
> > > > > > > after a few fetches I also notice that the Hadoop job throws a
> OOM
> > > > > error ,
> > > > > > > please advise.
> > > > > > >
> > > > > > > 2013-12-15 23:47:31,921 INFO [FetcherThread]
> > > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new
> http
> > > > > > > client,
> > > > >
> > >
> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > > > 2013-12-15 23:47:31,931 INFO [FetcherThread]
> > > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new
> http
> > > > > > > client,
> > > > >
> > >
> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > > > 2013-12-15 23:47:31,932 INFO [FetcherThread]
> > > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new
> http
> > > > > > > client,
> > > > >
> > >
> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > > > 2013-12-15 23:47:32,034 INFO [FetcherThread]
> > > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new
> http
> > > > > > > client,
> > > > >
> > >
> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > > > 2013-12-15 23:47:32,034 INFO [FetcherThread]
> > > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new
> http
> > > > > > > client,
> > > > >
> > >
> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > > > 2013-12-15 23:47:32,040 INFO [FetcherThread]
> > > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new
> http
> > > > > > > client,
> > > > >
> > >
> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > > > 2013-12-15 23:47:32,187 INFO [FetcherThread]
> > > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new
> http
> > > > > > > client,
> > > > >
> > >
> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > > > 2013-12-15 23:47:32,214 INFO [FetcherThread]
> > > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new
> http
> > > > > > > client,
> > > > >
> > >
> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > > > 2013-12-15 23:47:32,250 INFO [FetcherThread]
> > > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new
> http
> > > > > > > client,
> > > > >
> > >
> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > > > 2013-12-15 23:47:32,264 INFO [FetcherThread]
> > > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new
> http
> > > > > > > client,
> > > > >
> > >
> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)

Reply via email to