Hi Andrew, > if this flag is used *--sitemaps-from-hostdb always*
Do the crawled hosts announce the sitemap in their robots.txt? If not does the sitemap URLs follow the pattern http://example.com/sitemap.xml ? See https://cwiki.apache.org/confluence/display/NUTCH/SitemapFeature If this is not the case, it's required to put the URLs pointing to the sitemaps into a separate list and call bin/crawl with the option `-sm <sitemap_dir>`. > nutch-default.xml set the interval to 2 seconds from default 30 days. Ok, for one day or even few hours. But why "2 seconds"? > I also don't understand why the crawldb is automatically deleted The crawldb isn't removed but updated after each cycle by - moving the previous version from "current/" to "old/" - placing the updated version in "current/" In doubt, and because bugs are always possible could you share the logs from the SitemapProcessor ? Best, Sebastian On 3/13/21 6:33 PM, Andrew MacKay wrote:
Hi hoping for some help to get sitemaps.xml working using this command to crawl (nutch 1.18) NUTCH_HOME/bin/crawl -i -D solr.server.url=http://localhost:8983/solr/nutch --sitemaps-from-hostdb always -s $NUTCH_HOME/urls/ $NUTCH_HOME/Crawl 10 if this flag is used *--sitemaps-from-hostdb always* *this error occurs* *Generator: number of items rejected during selection:Generator: 201 SCHEDULE_REJECTEDGenerator: 0 records selected for fetching, exiting ...* without this flag present it crawls the site without issue and nutch-default.xml set the interval to 2 seconds from default 30 days. <name>db.fetch.interval.default</name> <value>2</value> I also don't understand why the crawldb is automatically deleted after each crawl so I cannot runn any commands about url's that are not crawled. Any help