Re: Seed List URLs To Index Question

Ferdy Galema Fri, 27 Jul 2012 02:10:14 -0700

Hi,

What version of Nutch are you running? Please note that urls not ending up
in the index can have many reasons. But most likely because of the fact
that not everything is crawled. (Which has many reasons of its own). Try to
run some of the statistics jobs or inspect job counters to see if the
expected numbers are all right.


Ferdy.


On Fri, Jul 27, 2012 at 6:18 AM, AC Nutch <[email protected]> wrote:

> Hi All,
>
> I am a fairly new Nutch user and this is my first post here. I have been
> attempting to use Nutch for a particular application I am building, a
> search engine. I have a pretty solid (I think) bash script that runs Nutch
> in deploy mode on a small Hadoop cluster, separating the
> generate-fetch-parse-update-index steps. Right now I'm just running with
> what I guess you'd call a "depth" of 1 in that I only run
> generate-fetch-parse-update once per injected set of URLs.
>
> I currently have a url filter set so that I only end up indexing base
> domains without subfolders. Using dmoz as the initial seed list all seems
> to go well, I have about 1.5 million records in Solr. My issue is that it's
> finished! Of the approximately 3 million base URLs in dmoz I only have 1.5
> million, from my perspective assuming most of these URLs are unique, and
> assuming I actually found some additional URLs to add to the index as
> should be the case, I should end up with way more. As well as that I notice
> that a lot of the URLs from dmoz were not added to the index. In fact none
> that I checked (I checked ~10).
>
> My question is, is this odd behavior for Nutch? Should my seed URLs be
> added to my Solr index? If this does sound odd, I'm at a bit of a loss as
> to what I can do differently, besides perhaps use a better or more
> expansive seed list or something. Is there a way to explicitly ensure that
> all URLs in the base seed list are added to the Nutch index?
>
> Thanks in advance for the help!
>

Re: Seed List URLs To Index Question

Reply via email to