Re: Fetching of URLs from seed list ends up with only a small portion of them indexed by Solr

Amit Sela Fri, 01 Mar 2013 16:02:27 -0800

I am using the crawl script that executes Solr indexing with:
  $bin/nutch solrindex $SOLRURL $CRAWL_PATH/crawldb -linkdb
$CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT
and then executes Solr dedup:
  $bin/nutch solrdedup $SOLRURL


I think it has something to do with the CrawlDB job. The job counters show:
db_redir_temp 4,770
db_redir_perm 56,810
db_notmodified 5,343
db_unfetched 27,385
db_gone  3,741
db_fetched 22,065


On Thu, Feb 28, 2013 at 10:02 PM, kiran chitturi
<[email protected]>wrote:

> This looks odd. From what i know, the successfully parsed documents are
> sent to Solr. Did you check the logs for any exceptions ?
>
> What command are you using to index ?
>
>
> On Thu, Feb 28, 2013 at 1:51 PM, Amit Sela <[email protected]> wrote:
>
> > Hi everyone,
> >
> > I'm running with nutch 1.6 and Solr 3.6.2.
> > I'm trying to crawl only the seed list (depth 1) and it seems that the
> > process ends with only ~255 of the URLs indexed in Solr.
> >
> > Seed list is about 120K.
> > Fetcher map input is 117K where success is 62K and temp_moved 45K.
> > Parse shows success of 62K.
> > CrawlDB after the fetch shows db_redir_perm=56K, db_unfetched=27K
> > and db_fetched=22K.
> >
> > And finally IndexerStatus shows 20K documents added.
> > What am I missing ?
> >
> > Thanks!
> >
> > my nutch-site.xml includes:
> > -----------------------------------------
> > <name>plugin.includes</name>
> >
> >
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|tika|metatags|js)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)i</value>
> > <name>metatags.names</name>
> > <value>keywords;Keywords;description;Description</value>
> > <name>index.parse.md</name>
> >
> >
> <value>metatag.keywords,metatag.Keywords,metatag.description,metatag.Description</value>
> > <name>db.update.additions.allowed</name>
> > <value>false</value>
> > <name>generate.count.mode</name>
> > <value>domain</value>
> > <name>partition.url.mode</name>
> > <value>byDomain</value>
> > <name>file.content.limit</name>
> > <value>262144</value>
> > <name>http.content.limit</name>
> > <value>262144</value>
> > <name>parse.filter.urls</name>
> > <value>true</value>
> > <name>parse.normalize.urls</name>
> > <value>true</value>
> >
>
>
>
> --
> Kiran Chitturi
>

Re: Fetching of URLs from seed list ends up with only a small portion of them indexed by Solr

Reply via email to