RE: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)

Markus Jelsma Tue, 17 Dec 2013 01:11:08 -0800
In Fetcher's  protected ParseStatus output(Text key, CrawlDatum datum, Content 
content, ProtocolStatus pstatus, int status, int outlinkDepth) you need to 
output a NutchIndexAction object for every add or delete you get. By comparing 
signature you can skip not_modified pages. You also need to check for 
robots=noindex and 404 pages.You need to build up a NutchDocument, pass it 
though indexing filters etc, just like the current IndexerMapReduce. In 
FetcherOutputFormat you need to create an IndexerOutputFormat just like the 
ParseOutputFormat that is already there, and handle the incoming 
NutchIndexAction in RecordWriter.write(). 
 
-----Original message-----
> From:S.L <[email protected]>
> Sent: Monday 16th December 2013 18:00
> To: [email protected]
> Subject: Re: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)
> 
> Markus,
> 
> Actually I started doing this after you replied to my queries few months
> ago , I did not face this issue when I was running Nutch in a local mode ,
> seems like this issue shows up when running in a deploy(single node
> cluster) mode.
> 
> I will go ahead and change this to index the document in the
> FetcherOutputFormat class( would you please tell me the line number to
> insert th code at ?).
> 
> However I was wondering if I would able to leverage the plugin mechanism to
> do this and if there is any Solr  plugin that takes the parsed text from
> the URL and indexes it based on some transformation that I do  ?
> 
> I really appreciate your help .
> 
> Thanks.
> 
> 
> On Mon, Dec 16, 2013 at 10:08 AM, Markus Jelsma
> <[email protected]>wrote:
> 
> > You have modified the Fetcher to index documents? In that case, you should
> > index in the reducer (FetcherOutputFormat), not while mapping, and reuse
> > the existing indexing code of SolrWriter. In any case, you should not
> > create a client per document.
> >
> > -----Original message-----
> > > From:S.L <[email protected]>
> > > Sent: Monday 16th December 2013 15:57
> > > To: [email protected]; [email protected]
> > > Subject: Re: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)
> > >
> > > Markus,
> > >
> > >
> > > Yes you are right FetcherThread does not use SolrJ by itself ,I am
> > adding a call to Solr to save the data. I am concerned about the number of
> > HttpClients being created,it.seems its creating a client per a document i
> > am saving in Solr,thiscould be expected but I just want to confirm.
> > >
> > > Thanks.
> > >
> > > Sent from my HTC Inspire™ 4G on AT&T
> > >
> > > ----- Reply message -----
> > > From: "Markus Jelsma" <[email protected]>
> > > To: "[email protected]" <[email protected]>
> > > Subject: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)
> > > Date: Mon, Dec 16, 2013 6:27 am
> > >
> > >
> > > Hi - How can this be in FetcherThread, Nutch does not use SolrJ in
> > Fetcher. Do you have the entire Fetcher log?
> > >
> > > -----Original message-----
> > > > From:S.L <[email protected]>
> > > > Sent: Monday 16th December 2013 6:40
> > > > To: [email protected]
> > > > Subject: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)
> > > >
> > > > Hi Folks,
> > > >
> > > > I am running Nutch 1.7 on Hadoop 2.2 and in the Hadoop logs for
> > > > FetcherThread, I see the following statements , which tells me that the
> > > > HttpCleints are being created per URL, is this correct assumption? Also
> > > > after a few fetches I also notice that the Hadoop job throws a OOM
> > error ,
> > > > please advise.
> > > >
> > > > 2013-12-15 23:47:31,921 INFO [FetcherThread]
> > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > client,
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > 2013-12-15 23:47:31,931 INFO [FetcherThread]
> > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > client,
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > 2013-12-15 23:47:31,932 INFO [FetcherThread]
> > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > client,
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > 2013-12-15 23:47:32,034 INFO [FetcherThread]
> > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > client,
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > 2013-12-15 23:47:32,034 INFO [FetcherThread]
> > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > client,
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > 2013-12-15 23:47:32,040 INFO [FetcherThread]
> > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > client,
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > 2013-12-15 23:47:32,187 INFO [FetcherThread]
> > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > client,
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > 2013-12-15 23:47:32,214 INFO [FetcherThread]
> > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > client,
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > 2013-12-15 23:47:32,250 INFO [FetcherThread]
> > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > client,
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > 2013-12-15 23:47:32,264 INFO [FetcherThread]
> > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > client,
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > >
> > >
> >
>
RE: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)

Reply via email to