Re: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)

S.L Tue, 17 Dec 2013 06:12:45 -0800

Markus

Thanks. All I am doing is take the parsed text and apply scraping to the text 
in the Fetcher for every URL and then index the data in Solr. It was working 
fine in local mode so far without any issues but in Hadoop 2.2 I get OOM 
exceptions.


Even though I am not entirely clear on the route suggested by you below,I would 
like to ask two questions before I embark on the suggested approach.

1.Can I use the plugin mechanism to make my code more modular and adapt to the 
MapReduce framework?
2.How can I use the Solr URL supplied in command line parameters to achieve 
this with minimal chnages to the Nutch code?

Thanks and much appreciated!


Sent from my HTC Inspire™ 4G on AT&T

----- Reply message -----
From: "Markus Jelsma" <[email protected]>
To: "[email protected]" <[email protected]>
Subject: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)
Date: Tue, Dec 17, 2013 4:09 am


In Fetcher's  protected ParseStatus output(Text key, CrawlDatum datum, Content 
content, ProtocolStatus pstatus, int status, int outlinkDepth) you need to 
output a NutchIndexAction object for every add or delete you get. By comparing 
signature you can skip not_modified pages. You also need to check for 
robots=noindex and 404 pages.You need to build up a NutchDocument, pass it 
though indexing filters etc, just like the current IndexerMapReduce. In 
FetcherOutputFormat you need to create an IndexerOutputFormat just like the 
ParseOutputFormat that is already there, and handle the incoming 
NutchIndexAction in RecordWriter.write(). 
 
-----Original message-----
> From:S.L <[email protected]>
> Sent: Monday 16th December 2013 18:00
> To: [email protected]
> Subject: Re: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)
> 
> Markus,
> 
> Actually I started doing this after you replied to my queries few months
> ago , I did not face this issue when I was running Nutch in a local mode ,
> seems like this issue shows up when running in a deploy(single node
> cluster) mode.
> 
> I will go ahead and change this to index the document in the
> FetcherOutputFormat class( would you please tell me the line number to
> insert th code at ?).
> 
> However I was wondering if I would able to leverage the plugin mechanism to
> do this and if there is any Solr  plugin that takes the parsed text from
> the URL and indexes it based on some transformation that I do  ?
> 
> I really appreciate your help .
> 
> Thanks.
> 
> 
> On Mon, Dec 16, 2013 at 10:08 AM, Markus Jelsma
> <[email protected]>wrote:
> 
> > You have modified the Fetcher to index documents? In that case, you should
> > index in the reducer (FetcherOutputFormat), not while mapping, and reuse
> > the existing indexing code of SolrWriter. In any case, you should not
> > create a client per document.
> >
> > -----Original message-----
> > > From:S.L <[email protected]>
> > > Sent: Monday 16th December 2013 15:57
> > > To: [email protected]; [email protected]
> > > Subject: Re: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)
> > >
> > > Markus,
> > >
> > >
> > > Yes you are right FetcherThread does not use SolrJ by itself ,I am
> > adding a call to Solr to save the data. I am concerned about the number of
> > HttpClients being created,it.seems its creating a client per a document i
> > am saving in Solr,thiscould be expected but I just want to confirm.
> > >
> > > Thanks.
> > >
> > > Sent from my HTC Inspire™ 4G on AT&T
> > >
> > > ----- Reply message -----
> > > From: "Markus Jelsma" <[email protected]>
> > > To: "[email protected]" <[email protected]>
> > > Subject: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)
> > > Date: Mon, Dec 16, 2013 6:27 am
> > >
> > >
> > > Hi - How can this be in FetcherThread, Nutch does not use SolrJ in
> > Fetcher. Do you have the entire Fetcher log?
> > >
> > > -----Original message-----
> > > > From:S.L <[email protected]>
> > > > Sent: Monday 16th December 2013 6:40
> > > > To: [email protected]
> > > > Subject: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)
> > > >
> > > > Hi Folks,
> > > >
> > > > I am running Nutch 1.7 on Hadoop 2.2 and in the Hadoop logs for
> > > > FetcherThread, I see the following statements , which tells me that the
> > > > HttpCleints are being created per URL, is this correct assumption? Also
> > > > after a few fetches I also notice that the Hadoop job throws a OOM
> > error ,
> > > > please advise.
> > > >
> > > > 2013-12-15 23:47:31,921 INFO [FetcherThread]
> > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > client,
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > 2013-12-15 23:47:31,931 INFO [FetcherThread]
> > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > client,
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > 2013-12-15 23:47:31,932 INFO [FetcherThread]
> > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > client,
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > 2013-12-15 23:47:32,034 INFO [FetcherThread]
> > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > client,
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > 2013-12-15 23:47:32,034 INFO [FetcherThread]
> > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > client,
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > 2013-12-15 23:47:32,040 INFO [FetcherThread]
> > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > client,
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > 2013-12-15 23:47:32,187 INFO [FetcherThread]
> > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > client,
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > 2013-12-15 23:47:32,214 INFO [FetcherThread]
> > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > client,
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > 2013-12-15 23:47:32,250 INFO [FetcherThread]
> > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > client,
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > > 2013-12-15 23:47:32,264 INFO [FetcherThread]
> > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http
> > > > client,
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > > >
> > >
> >
>

Re: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2)

Reply via email to