Hi All, I am a fairly new Nutch user and this is my first post here. I have been attempting to use Nutch for a particular application I am building, a search engine. I have a pretty solid (I think) bash script that runs Nutch in deploy mode on a small Hadoop cluster, separating the generate-fetch-parse-update-index steps. Right now I'm just running with what I guess you'd call a "depth" of 1 in that I only run generate-fetch-parse-update once per injected set of URLs.
I currently have a url filter set so that I only end up indexing base domains without subfolders. Using dmoz as the initial seed list all seems to go well, I have about 1.5 million records in Solr. My issue is that it's finished! Of the approximately 3 million base URLs in dmoz I only have 1.5 million, from my perspective assuming most of these URLs are unique, and assuming I actually found some additional URLs to add to the index as should be the case, I should end up with way more. As well as that I notice that a lot of the URLs from dmoz were not added to the index. In fact none that I checked (I checked ~10). My question is, is this odd behavior for Nutch? Should my seed URLs be added to my Solr index? If this does sound odd, I'm at a bit of a loss as to what I can do differently, besides perhaps use a better or more expansive seed list or something. Is there a way to explicitly ensure that all URLs in the base seed list are added to the Nutch index? Thanks in advance for the help!

