Hi All,

I am a fairly new Nutch user and this is my first post here. I have been
attempting to use Nutch for a particular application I am building, a
search engine. I have a pretty solid (I think) bash script that runs Nutch
in deploy mode on a small Hadoop cluster, separating the
generate-fetch-parse-update-index steps. Right now I'm just running with
what I guess you'd call a "depth" of 1 in that I only run
generate-fetch-parse-update once per injected set of URLs.

I currently have a url filter set so that I only end up indexing base
domains without subfolders. Using dmoz as the initial seed list all seems
to go well, I have about 1.5 million records in Solr. My issue is that it's
finished! Of the approximately 3 million base URLs in dmoz I only have 1.5
million, from my perspective assuming most of these URLs are unique, and
assuming I actually found some additional URLs to add to the index as
should be the case, I should end up with way more. As well as that I notice
that a lot of the URLs from dmoz were not added to the index. In fact none
that I checked (I checked ~10).

My question is, is this odd behavior for Nutch? Should my seed URLs be
added to my Solr index? If this does sound odd, I'm at a bit of a loss as
to what I can do differently, besides perhaps use a better or more
expansive seed list or something. Is there a way to explicitly ensure that
all URLs in the base seed list are added to the Nutch index?

Thanks in advance for the help!

Reply via email to