I just started learning about nutch and solr and I am starting to get
confuse over some issue.
I using cygwin on windows xp
Basically I crawl with this command:
sh nutch crawl urls -dir ../data/jf -topN 1000
So basically this means that each segments will contain 1000 urls right?
So i went to the jf folder and see there are 2 folder under segments
with timestamp as name.
So theorically I should have 2000 documents right? Or wrong?
so I index it to solr with solrindex
Using the catch-all query *:* return "numFound" to be 77.
Some of the urls i supposed was crawled was not found in the results.?
Anyone can point me in the right direction?