RE: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)

Markus Jelsma Fri, 20 Dec 2013 04:53:22 -0800

Hi - if you're not using NutchDocument and all other interesting stuff from 
IndexerMapReduce and stuff but just write a simple document to Solr you should 
not need to use Reducer and IndexerOutputFormat. Perhaps the OOM occurs because 
you continously create new objects, hence the logging of HttpClientUtil you 
reported earlier. Those objects should either be cleaned up but preferrable 
reused. You should have only one object per thread.


-----Original message-----
> From:S.L <[email protected]>
> Sent: Thursday 19th December 2013 16:53
> To: [email protected]
> Subject: Re: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)
> 
> Markus,
> 
> I am not very clear about the suggested route due to my unfamiliarity with
> Nutch. All I am doing in the Fetcher is intercepting the URL , extracting
> precise data and populating Solr using a Utility methos from a third party
> jar. What would I gain from doing this in a Reduce phase suggested by you
> instead of the apparent Map phase I am doing it in. I would really
> appreciate if you could clear that.
> 
> Also , after I started running Nutch1.7 in Hadoop 2.2 I started getting OOM
> exceptions , I have already increased my HADOOP_HEAPSIZE to 4GB on a 8GB
> laptop, this never happened when I was running in a lcoal mode and I was
> able to crawl large ( few 100Ks ) easily. Can you please also let me know
> what the issue might be ?
> 
> Thanks again and much appreciated !
> 
> 
> On Tue, Dec 17, 2013 at 4:09 AM, Markus Jelsma
> <[email protected]>wrote:
> 
> > In Fetcher's  protected ParseStatus output(Text key, CrawlDatum datum,
> > Content content, ProtocolStatus pstatus, int status, int outlinkDepth) you
> > need to output a NutchIndexAction object for every add or delete you get.
> > By comparing signature you can skip not_modified pages. You also need to
> > check for robots=noindex and 404 pages.You need to build up a
> > NutchDocument, pass it though indexing filters etc, just like the current
> > IndexerMapReduce. In FetcherOutputFormat you need to create an
> > IndexerOutputFormat just like the ParseOutputFormat that is already there,
> > and handle the incoming NutchIndexAction in RecordWriter.write().
> >
> > -----Original message-----
> > > From:S.L <[email protected]>
> > > Sent: Monday 16th December 2013 18:00
> > > To: [email protected]
> > > Subject: Re: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)
> > >
> > > Markus,
> > >
> > > Actually I started doing this after you replied to my queries few months
> > > ago , I did not face this issue when I was running Nutch in a local mode
> > ,
> > > seems like this issue shows up when running in a deploy(single node
> > > cluster) mode.
> > >
> > > I will go ahead and change this to index the document in the
> > > FetcherOutputFormat class( would you please tell me the line number to
> > > insert th code at ?).
> > >
> > > However I was wondering if I would able to leverage the plugin mechanism
> > to
> > > do this and if there is any Solr  plugin that takes the parsed text from
> > > the URL and indexes it based on some transformation that I do  ?
> > >
> > > I really appreciate your help .
> > >
> > > Thanks.
> > >
> > >
> > > On Mon, Dec 16, 2013 at 10:08 AM, Markus Jelsma
> > > <[email protected]>wrote:
> > >
> > > > You have modified the Fetcher to index documents? In that case, you
> > should
> > > > index in the reducer (FetcherOutputFormat), not while mapping, and
> > reuse
> > > > the existing indexing code of SolrWriter. In any case, you should not
> > > > create a client per document.
> > > >
> > > > -----Original message-----
> > > > > From:S.L <[email protected]>
> > > > > Sent: Monday 16th December 2013 15:57
> > > > > To: [email protected]; [email protected]
> > > > > Subject: Re: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)
> > > > >
> > > > > Markus,
> > > > >
> > > > >
> > > > > Yes you are right FetcherThread does not use SolrJ by itself ,I am
> > > > adding a call to Solr to save the data. I am concerned about the
> > number of
> > > > HttpClients being created,it.seems its creating a client per a
> > document i
> > > > am saving in Solr,thiscould be expected but I just want to confirm.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > Sent from my HTC Inspire™ 4G on AT&T
> > > > >
> > > > > ----- Reply message -----
> > > > > From: "Markus Jelsma" <[email protected]>
> > > > > To: "[email protected]" <[email protected]>
> > > > > Subject: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)
> > > > > Date: Mon, Dec 16, 2013 6:27 am
> > > > >
> > > > >
> > > > > Hi - How can this be in FetcherThread, Nutch does not use SolrJ in
> > > > Fetcher. Do you have the entire Fetcher log?
> > > > >
> > > > > -----Original message-----
> > > > > > From:S.L <[email protected]>
> > > > > > Sent: Monday 16th December 2013 6:40
> > > > > > To: [email protected]
> > > > > > Subject: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)
> > > > > >
> > > > > > Hi Folks,
> > > > > >
> > > > > > I am running Nutch 1.7 on Hadoop 2.2 and in the Hadoop logs for
> > > > > > FetcherThread, I see the following statements , which tells me
> > that the
> > > > > > HttpCleints are being created per URL, is this correct assumption?
> > Also
> > > > > > after a few fetches I also notice that the Hadoop job throws a OOM
> > > > error ,
> > > > > > please advise.
> > > > > >
> > > > > > 2013-12-15 23:47:31,921 INFO [FetcherThread]
> > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > > > client,
> > > >
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > > 2013-12-15 23:47:31,931 INFO [FetcherThread]
> > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > > > client,
> > > >
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > > 2013-12-15 23:47:31,932 INFO [FetcherThread]
> > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > > > client,
> > > >
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > > 2013-12-15 23:47:32,034 INFO [FetcherThread]
> > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > > > client,
> > > >
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > > 2013-12-15 23:47:32,034 INFO [FetcherThread]
> > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > > > client,
> > > >
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > > 2013-12-15 23:47:32,040 INFO [FetcherThread]
> > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > > > client,
> > > >
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > > 2013-12-15 23:47:32,187 INFO [FetcherThread]
> > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > > > client,
> > > >
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > > 2013-12-15 23:47:32,214 INFO [FetcherThread]
> > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > > > client,
> > > >
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > > 2013-12-15 23:47:32,250 INFO [FetcherThread]
> > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > > > client,
> > > >
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > > 2013-12-15 23:47:32,264 INFO [FetcherThread]
> > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > > > client,
> > > >
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > >
> > > > >
> > > >
> > >
> >
>

RE: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)

Reply via email to