Seed List URLs To Index Question

AC Nutch Thu, 26 Jul 2012 21:18:44 -0700

Hi All,

I am a fairly new Nutch user and this is my first post here. I have been
attempting to use Nutch for a particular application I am building, a
search engine. I have a pretty solid (I think) bash script that runs Nutch
in deploy mode on a small Hadoop cluster, separating the
generate-fetch-parse-update-index steps. Right now I'm just running with
what I guess you'd call a "depth" of 1 in that I only run
generate-fetch-parse-update once per injected set of URLs.


I currently have a url filter set so that I only end up indexing base
domains without subfolders. Using dmoz as the initial seed list all seems
to go well, I have about 1.5 million records in Solr. My issue is that it's
finished! Of the approximately 3 million base URLs in dmoz I only have 1.5
million, from my perspective assuming most of these URLs are unique, and
assuming I actually found some additional URLs to add to the index as
should be the case, I should end up with way more. As well as that I notice
that a lot of the URLs from dmoz were not added to the index. In fact none
that I checked (I checked ~10).

My question is, is this odd behavior for Nutch? Should my seed URLs be
added to my Solr index? If this does sound odd, I'm at a bit of a loss as
to what I can do differently, besides perhaps use a better or more
expansive seed list or something. Is there a way to explicitly ensure that
all URLs in the base seed list are added to the Nutch index?

Thanks in advance for the help!

Seed List URLs To Index Question

Reply via email to