Re: solr and nutch confusion...

Mathijs Homminga Mon, 14 Nov 2011 23:59:20 -0800

Hi,

First of all, it may depend on the number of urls you are injecting (number of 
urls in ../data/jf).
If this is less than 1000, the first segment will be smaller and depending on 
the number of outlinks found, the second segment might also be.

It can also depend on the maximum number of urls per domain you're fetching 
(although I believe there is no restriction by default: "generate.max.count")
If this is set to 100 and you have only one domain in your list, then you might 
end up with just 200 fetched urls.

It can also depend on the fetch result. If you select 1000 urls (topN = 1000) 
but only 77 of them were fetched successfully....

It may also depend on removing duplicate urls.

Please take a look at your crawldb to check for more details using the 
CrawlDbReader tool.
And you might also want to look at the logs for clues.

Cheers,
Mathijs

On Nov 15, 2011, at 3:57 , codegigabyte wrote:

> I just started learning about nutch and solr and I am starting to get confuse 
> over some issue.
> 
> I using cygwin on windows xp
> 
> Basically I crawl with this command:
> 
> sh nutch crawl urls -dir ../data/jf -topN 1000
> 
> So basically this means that each segments will contain 1000 urls right?
> 
> So i went to  the jf folder and see there are 2 folder under segments with 
> timestamp as name.
> 
> So theorically I should have 2000 documents right? Or wrong?
> 
> so I index it to solr with solrindex
> 
> Using the catch-all query *:* return "numFound" to be 77.
> 
> Some of the urls i supposed was crawled was not found in the results.?
> 
> Anyone can point me in the right direction?
>

Re: solr and nutch confusion...

Reply via email to