I do not have much experience with refreshing, but it seems this bug is
related to a markers bug in the DbUpdaterReducer. If you look in that class
(in HEAD) there is a line at 183 that should remove the generator mark..

On Wed, Aug 22, 2012 at 4:19 PM, hugo.ma <[email protected]> wrote:

> Hi.
>
> This is a conceptual question, i haven't tried yet.
> The nutch version i am using is 2.0.
>
> Suppose that a full crawl has been made and the nutch hsql database is
> filled with all the data.
> In my perception  if i run nutch again, it should refetch all urls
> according
> to the fetchSchedule rules.
> The process responsible for marking the url's for fetching is the Generator
> mapper but looking to the code i see
> this:(org.apache.nutch.crawl.GeneratorMapper)
> if (Mark.GENERATE_MARK.checkMark(page) != null) {
>       if (GeneratorJob.LOG.isDebugEnabled()) {
>         GeneratorJob.LOG.debug("Skipping " + url + "; already generated");
>       }
>       return;
>     }
>
> The Generate_MARK is allways != null because after the first crawl the
> field
> 'MARKERS' of the database has:
>   __prsmrk__*1345638110-1053938230 _gnmrk_*1345638110-1053938230
> _ftcmrk_*1345638110-1053938230.
> The GenerateMark is allways present.
>
> So, is my assumption correct or am i missing something?
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Question-about-recrawl-tp4002651.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Reply via email to