I do not have much experience with refreshing, but it seems this bug is related to a markers bug in the DbUpdaterReducer. If you look in that class (in HEAD) there is a line at 183 that should remove the generator mark..
On Wed, Aug 22, 2012 at 4:19 PM, hugo.ma <[email protected]> wrote: > Hi. > > This is a conceptual question, i haven't tried yet. > The nutch version i am using is 2.0. > > Suppose that a full crawl has been made and the nutch hsql database is > filled with all the data. > In my perception if i run nutch again, it should refetch all urls > according > to the fetchSchedule rules. > The process responsible for marking the url's for fetching is the Generator > mapper but looking to the code i see > this:(org.apache.nutch.crawl.GeneratorMapper) > if (Mark.GENERATE_MARK.checkMark(page) != null) { > if (GeneratorJob.LOG.isDebugEnabled()) { > GeneratorJob.LOG.debug("Skipping " + url + "; already generated"); > } > return; > } > > The Generate_MARK is allways != null because after the first crawl the > field > 'MARKERS' of the database has: > __prsmrk__*1345638110-1053938230 _gnmrk_*1345638110-1053938230 > _ftcmrk_*1345638110-1053938230. > The GenerateMark is allways present. > > So, is my assumption correct or am i missing something? > > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Question-about-recrawl-tp4002651.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

