Re: googled for ever and still can't figure it out

Sebastian Nagel Mon, 15 Mar 2021 02:53:25 -0700

Hi Andrew,

> if this flag is used *--sitemaps-from-hostdb always*

Do the crawled hosts announce the sitemap in their robots.txt?
If not does the sitemap URLs follow the pattern
  http://example.com/sitemap.xml ?

See https://cwiki.apache.org/confluence/display/NUTCH/SitemapFeature

If this is not the case, it's required to put the URLs pointing
to the sitemaps into a separate list and call bin/crawl with the
option `-sm <sitemap_dir>`.

> nutch-default.xml set the interval to 2 seconds from default 30 days.

Ok, for one day or even few hours. But why "2 seconds"?

> I also don't understand why the crawldb is automatically deleted

The crawldb isn't removed but updated after each cycle by
- moving the previous version from "current/" to "old/"
- placing the updated version in "current/"

In doubt, and because bugs are always possible could you share the logs
from the SitemapProcessor ?

Best,
Sebastian

On 3/13/21 6:33 PM, Andrew MacKay wrote:

Hi

hoping for some help to get sitemaps.xml working
using this command to crawl  (nutch 1.18)

NUTCH_HOME/bin/crawl -i -D solr.server.url=http://localhost:8983/solr/nutch
--sitemaps-from-hostdb always -s $NUTCH_HOME/urls/ $NUTCH_HOME/Crawl 10

if this flag is used *--sitemaps-from-hostdb always*
*this error occurs*

*Generator: number of items rejected during selection:Generator:    201
SCHEDULE_REJECTEDGenerator: 0 records selected for fetching, exiting ...*

without this flag present   it crawls the site without issue and

nutch-default.xml set the interval to 2 seconds from default 30 days.

  <name>db.fetch.interval.default</name>

   <value>2</value>

I also don't understand why the crawldb is automatically deleted after each
crawl so I cannot runn any commands about url's that are not crawled.

Any help

Re: googled for ever and still can't figure it out

Reply via email to