Hi,

While working for a client we came across a use case that seems like it
might not be uncommon.  We may have some code to contribute.

The use case is that we have a few seed URLs that we need to fetch at
relatively high frequency (e.g. every N minutes).  There URLs have pointers
to news type of content.  Thus, these seed URLs are used primarily for URL
discovery.  From there we do w  relatively shallow crawl.  But the
important thing is that we need to make sure we get to refetching seed URLs
(depth=0) at some high frequency, while all other URLs can be refetched at
their default frequency.  In case of news that actually probably means
"fetch once and never again".

So I'm wondering if a simple custom "seed URL scheduler" would be of
interest.  Something like:

if (URL is seed)
  fetch at seed URL fetch freq
else
  fetch at standard freq

?

.... or if this can already be done without a custom scheduler, I'd love to
know how!

Thanks,
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

Reply via email to