Hi, 1. i am not sure but likely, our Nutch doesn't have the pluggable indexing backends but i think it also uses IndexerMapReduce and IndexOutputFormat and IndexWriter 2. yes, it will use SolrWriter and it reads parameters from nutch-site, just as it normally would
Cheers -----Original message----- > From:S.L <[email protected]> > Sent: Tuesday 17th December 2013 15:09 > To: [email protected]; [email protected] > Subject: Re: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2) > > Markus > > Thanks. All I am doing is take the parsed text and apply scraping to the text > in the Fetcher for every URL and then index the data in Solr. It was working > fine in local mode so far without any issues but in Hadoop 2.2 I get OOM > exceptions. > > Even though I am not entirely clear on the route suggested by you below,I > would like to ask two questions before I embark on the suggested approach. > > 1.Can I use the plugin mechanism to make my code more modular and adapt to > the MapReduce framework? > 2.How can I use the Solr URL supplied in command line parameters to achieve > this with minimal chnages to the Nutch code? > > Thanks and much appreciated! > > > Sent from my HTC Inspire⢠4G on AT&T > > ----- Reply message ----- > From: "Markus Jelsma" <[email protected]> > To: "[email protected]" <[email protected]> > Subject: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2) > Date: Tue, Dec 17, 2013 4:09 am > > > In Fetcher's protected ParseStatus output(Text key, CrawlDatum datum, > Content content, ProtocolStatus pstatus, int status, int outlinkDepth) you > need to output a NutchIndexAction object for every add or delete you get. By > comparing signature you can skip not_modified pages. You also need to check > for robots=noindex and 404 pages.You need to build up a NutchDocument, pass > it though indexing filters etc, just like the current IndexerMapReduce. In > FetcherOutputFormat you need to create an IndexerOutputFormat just like the > ParseOutputFormat that is already there, and handle the incoming > NutchIndexAction in RecordWriter.write(). > > -----Original message----- > > From:S.L <[email protected]> > > Sent: Monday 16th December 2013 18:00 > > To: [email protected] > > Subject: Re: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2) > > > > Markus, > > > > Actually I started doing this after you replied to my queries few months > > ago , I did not face this issue when I was running Nutch in a local mode , > > seems like this issue shows up when running in a deploy(single node > > cluster) mode. > > > > I will go ahead and change this to index the document in the > > FetcherOutputFormat class( would you please tell me the line number to > > insert th code at ?). > > > > However I was wondering if I would able to leverage the plugin mechanism to > > do this and if there is any Solr plugin that takes the parsed text from > > the URL and indexes it based on some transformation that I do ? > > > > I really appreciate your help . > > > > Thanks. > > > > > > On Mon, Dec 16, 2013 at 10:08 AM, Markus Jelsma > > <[email protected]>wrote: > > > > > You have modified the Fetcher to index documents? In that case, you should > > > index in the reducer (FetcherOutputFormat), not while mapping, and reuse > > > the existing indexing code of SolrWriter. In any case, you should not > > > create a client per document. > > > > > > -----Original message----- > > > > From:S.L <[email protected]> > > > > Sent: Monday 16th December 2013 15:57 > > > > To: [email protected]; [email protected] > > > > Subject: Re: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2) > > > > > > > > Markus, > > > > > > > > > > > > Yes you are right FetcherThread does not use SolrJ by itself ,I am > > > adding a call to Solr to save the data. I am concerned about the number of > > > HttpClients being created,it.seems its creating a client per a document i > > > am saving in Solr,thiscould be expected but I just want to confirm. > > > > > > > > Thanks. > > > > > > > > Sent from my HTC Inspire⢠4G on AT&T > > > > > > > > ----- Reply message ----- > > > > From: "Markus Jelsma" <[email protected]> > > > > To: "[email protected]" <[email protected]> > > > > Subject: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2) > > > > Date: Mon, Dec 16, 2013 6:27 am > > > > > > > > > > > > Hi - How can this be in FetcherThread, Nutch does not use SolrJ in > > > Fetcher. Do you have the entire Fetcher log? > > > > > > > > -----Original message----- > > > > > From:S.L <[email protected]> > > > > > Sent: Monday 16th December 2013 6:40 > > > > > To: [email protected] > > > > > Subject: Excessive HttpClient creation (Nutch 1.7 on Hadoop 2.2) > > > > > > > > > > Hi Folks, > > > > > > > > > > I am running Nutch 1.7 on Hadoop 2.2 and in the Hadoop logs for > > > > > FetcherThread, I see the following statements , which tells me that > > > > > the > > > > > HttpCleints are being created per URL, is this correct assumption? > > > > > Also > > > > > after a few fetches I also notice that the Hadoop job throws a OOM > > > error , > > > > > please advise. > > > > > > > > > > 2013-12-15 23:47:31,921 INFO [FetcherThread] > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http > > > > > client, > > > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false > > > > > 2013-12-15 23:47:31,931 INFO [FetcherThread] > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http > > > > > client, > > > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false > > > > > 2013-12-15 23:47:31,932 INFO [FetcherThread] > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http > > > > > client, > > > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false > > > > > 2013-12-15 23:47:32,034 INFO [FetcherThread] > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http > > > > > client, > > > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false > > > > > 2013-12-15 23:47:32,034 INFO [FetcherThread] > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http > > > > > client, > > > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false > > > > > 2013-12-15 23:47:32,040 INFO [FetcherThread] > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http > > > > > client, > > > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false > > > > > 2013-12-15 23:47:32,187 INFO [FetcherThread] > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http > > > > > client, > > > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false > > > > > 2013-12-15 23:47:32,214 INFO [FetcherThread] > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http > > > > > client, > > > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false > > > > > 2013-12-15 23:47:32,250 INFO [FetcherThread] > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http > > > > > client, > > > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false > > > > > 2013-12-15 23:47:32,264 INFO [FetcherThread] > > > > > org.apache.solr.client.solrj.impl.HttpClientUtil: Creating new http > > > > > client, > > > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false > > > > > > > > > > > > > > >

