RE: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)

Markus Jelsma Tue, 17 Dec 2013 06:15:08 -0800

Hi,

1. i am not sure but likely, our Nutch doesn't have the pluggable indexing 
backends but i think it also uses IndexerMapReduce and IndexOutputFormat and 
IndexWriter
2. yes, it will use SolrWriter and it reads parameters from nutch-site, just as 
it normally would


Cheers

 
 
-----Original message-----
> From:S.L <[email protected]>
> Sent: Tuesday 17th December 2013 15:09
> To: [email protected]; [email protected]
> Subject: Re: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)
> 
> Markus
> 
> Thanks. All I am doing is take the parsed text and apply scraping to the text 
> in the Fetcher for every URL and then index the data in Solr. It was working 
> fine in local mode so far without any issues but in Hadoop 2.2 I get OOM 
> exceptions.
> 
> Even though I am not entirely clear on the route suggested by you below,I 
> would like to ask two questions before I embark on the suggested approach.
> 
> 1.Can I use the plugin mechanism to make my code more modular and adapt to 
> the MapReduce framework?
> 2.How can I use the Solr URL supplied in command line parameters to achieve 
> this with minimal chnages to the Nutch code?
> 
> Thanks and much appreciated!
> 
> 
> Sent from my HTC Inspire™ 4G on AT&T
> 
> ----- Reply message -----
> From: "Markus Jelsma" <[email protected]>
> To: "[email protected]" <[email protected]>
> Subject: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)
> Date: Tue, Dec 17, 2013 4:09 am
> 
> 
> In Fetcher's  protected ParseStatus output(Text key, CrawlDatum datum, 
> Content content, ProtocolStatus pstatus, int status, int outlinkDepth) you 
> need to output a NutchIndexAction object for every add or delete you get. By 
> comparing signature you can skip not_modified pages. You also need to check 
> for robots=noindex and 404 pages.You need to build up a NutchDocument, pass 
> it though indexing filters etc, just like the current IndexerMapReduce. In 
> FetcherOutputFormat you need to create an IndexerOutputFormat just like the 
> ParseOutputFormat that is already there, and handle the incoming 
> NutchIndexAction in RecordWriter.write(). 
>  
> -----Original message-----
> > From:S.L <[email protected]>
> > Sent: Monday 16th December 2013 18:00
> > To: [email protected]
> > Subject: Re: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)
> > 
> > Markus,
> > 
> > Actually I started doing this after you replied to my queries few months
> > ago , I did not face this issue when I was running Nutch in a local mode ,
> > seems like this issue shows up when running in a deploy(single node
> > cluster) mode.
> > 
> > I will go ahead and change this to index the document in the
> > FetcherOutputFormat class( would you please tell me the line number to
> > insert th code at ?).
> > 
> > However I was wondering if I would able to leverage the plugin mechanism to
> > do this and if there is any Solr  plugin that takes the parsed text from
> > the URL and indexes it based on some transformation that I do  ?
> > 
> > I really appreciate your help .
> > 
> > Thanks.
> > 
> > 
> > On Mon, Dec 16, 2013 at 10:08 AM, Markus Jelsma
> > <[email protected]>wrote:
> > 
> > > You have modified the Fetcher to index documents? In that case, you should
> > > index in the reducer (FetcherOutputFormat), not while mapping, and reuse
> > > the existing indexing code of SolrWriter. In any case, you should not
> > > create a client per document.
> > >
> > > -----Original message-----
> > > > From:S.L <[email protected]>
> > > > Sent: Monday 16th December 2013 15:57
> > > > To: [email protected]; [email protected]
> > > > Subject: Re: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)
> > > >
> > > > Markus,
> > > >
> > > >
> > > > Yes you are right FetcherThread does not use SolrJ by itself ,I am
> > > adding a call to Solr to save the data. I am concerned about the number of
> > > HttpClients being created,it.seems its creating a client per a document i
> > > am saving in Solr,thiscould be expected but I just want to confirm.
> > > >
> > > > Thanks.
> > > >
> > > > Sent from my HTC Inspire™ 4G on AT&T
> > > >
> > > > ----- Reply message -----
> > > > From: "Markus Jelsma" <[email protected]>
> > > > To: "[email protected]" <[email protected]>
> > > > Subject: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)
> > > > Date: Mon, Dec 16, 2013 6:27 am
> > > >
> > > >
> > > > Hi - How can this be in FetcherThread, Nutch does not use SolrJ in
> > > Fetcher. Do you have the entire Fetcher log?
> > > >
> > > > -----Original message-----
> > > > > From:S.L <[email protected]>
> > > > > Sent: Monday 16th December 2013 6:40
> > > > > To: [email protected]
> > > > > Subject: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)
> > > > >
> > > > > Hi Folks,
> > > > >
> > > > > I am running Nutch 1.7 on Hadoop 2.2 and in the Hadoop logs for
> > > > > FetcherThread, I see the following statements , which tells me that 
> > > > > the
> > > > > HttpCleints are being created per URL, is this correct assumption? 
> > > > > Also
> > > > > after a few fetches I also notice that the Hadoop job throws a OOM
> > > error ,
> > > > > please advise.
> > > > >
> > > > > 2013-12-15 23:47:31,921 INFO [FetcherThread]
> > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > > client,
> > > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > 2013-12-15 23:47:31,931 INFO [FetcherThread]
> > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > > client,
> > > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > 2013-12-15 23:47:31,932 INFO [FetcherThread]
> > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > > client,
> > > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > 2013-12-15 23:47:32,034 INFO [FetcherThread]
> > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > > client,
> > > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > 2013-12-15 23:47:32,034 INFO [FetcherThread]
> > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > > client,
> > > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > 2013-12-15 23:47:32,040 INFO [FetcherThread]
> > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > > client,
> > > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > 2013-12-15 23:47:32,187 INFO [FetcherThread]
> > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > > client,
> > > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > 2013-12-15 23:47:32,214 INFO [FetcherThread]
> > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > > client,
> > > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > 2013-12-15 23:47:32,250 INFO [FetcherThread]
> > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > > client,
> > > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > > 2013-12-15 23:47:32,264 INFO [FetcherThread]
> > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > > client,
> > > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > >
> > > >
> > >
> > 
>

RE: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)

Reply via email to