Hi, Maybe you can use Luke (http://code.google.com/p/luke/) to check you indexes. check which url is already download and not indexed. so you can check your crawl log and see what happens. The Nutch will delete the duplicated content when two pages have same digest.
On Thu, Oct 11, 2012 at 9:57 PM, Lewis John Mcgibbney < [email protected]> wrote: > Hi, > > After every crawl iteration check out your webdb with the readdb tool. > There is pleanty linked to from the wiki on this topic. Check > urlfilters as an important area as well. > > hth > > Lewis > > On Fri, Oct 5, 2012 at 6:08 PM, Hailong Yang <[email protected]> > wrote: > > Dear all, > > > > > > > > I am trying to crawl a large index (maybe more than 2Gb) for future > > analysis. However, after 11 hours crawling, I looked at the crawl > directory > > which was 1.2GB as a whole, but the size of the index was only 50MB. > Could > > someone tell me how to configure the crawling so that I can retrieve a > large > > enough index. Thank you! > > > > > > > > Best > > > > > > > > Hailong > > > > > > -- > Lewis > -- Don't Grow Old, Grow Up... :-)

