I'm a little confused -- should I set up a whole other instance of nutch, crawldb, etc?
Set the log to trace, I think this helps to tell why..... 2011-12-19 20:14:10,716 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,716 INFO crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19 20:14:10,716 INFO crawl.AbstractFetchSchedule - maxInterval=77760002011-12-19 20:14:10,738 DEBUG crawl.Generator - -shouldFetch rejected 'http://url/Alpha.docx', fetchTime=1328213639379, curTime=13247576493782011-12-19 20:14:10,843 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,843 INFO crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19 20:14:10,843 INFO crawl.AbstractFetchSchedule - maxInterval=77760002011-12-19 20:14:11,145 WARN crawl.Generator - Generator: 0 records selected for fetching, exiting ... Now, before I ran this I cleared the crawldb, linkdb & segments, but I still got a rejected because it is before the next fetch time...why do I get that? How do I set it up to always crawl all the docs? (Not practical for production, but it's what I want when testing...) -- Chris On Mon, Dec 19, 2011 at 2:42 PM, Markus Jelsma <[email protected]> wrote: > >> > Hmm, the status db_gone prevents it from being indexed, of course. It is >> > perfectly possible for the checkers to pass but that the fetcher will >> > fail. There may have been an error and i remeber you using a proxy >> > earlier, that's likely the problem here too. The checkers don't use >> > proxy configurations. >> > >> > Check the logs to make sure. >> >> I cut out the proxy, and that let me get as far as I have now. Having >> that in place prevents me from crawling the local source...is there >> any way to be able to crawl both the inside & outside networks? >> [separate issue, but something that I'll need this to do] > > Not that i know of. You can use separate configs but this is tricky. Better > use separate crawldb's configs etc. > >> >> > That's good. But remember, to pass it _must_ match regex prefixed by a +. >> > This, however, is not your problem because in that case it wouldn't have >> > ended up in the CrawlDB at all. >> >> I have two +'s that it should match on, including +.* > > That'll do. > >> >> > Check the fetcher output thoroughly. Grep around. You should find it. >> >> What exactly am I grepping for? >> This is the block between the doc and the next one that it tries to >> crawl.... > > Hmm, that looks fine but can still indicate a 404 because a 404 is not an > error. Does debug say anything? You can set the level for the Fetcher in > conf/log4j.properties. You can use freegen tool to generate a segments from > some input text for tests. > >> >> 2011-12-19 18:42:19,538 INFO fetcher.Fetcher - fetching >> http://url/Alpha.docx 2011-12-19 18:42:19,538 INFO fetcher.Fetcher - >> Using queue mode : byHost 2011-12-19 18:42:19,539 INFO fetcher.Fetcher - >> Using queue mode : byHost 2011-12-19 18:42:19,540 INFO fetcher.Fetcher - >> Using queue mode : byHost 2011-12-19 18:42:19,541 INFO fetcher.Fetcher - >> Using queue mode : byHost 2011-12-19 18:42:19,541 INFO fetcher.Fetcher - >> Using queue mode : byHost 2011-12-19 18:42:19,541 INFO fetcher.Fetcher - >> Using queue mode : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - >> Using queue mode : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - >> Using queue mode : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - >> Fetcher: throughput threshold: -1 >> 2011-12-19 18:42:19,543 INFO fetcher.Fetcher - Fetcher: throughput >> threshold retries: 5 >> 2011-12-19 18:42:19,545 INFO http.Http - http.proxy.host = null >> 2011-12-19 18:42:19,545 INFO http.Http - http.proxy.port = 8080 >> 2011-12-19 18:42:19,545 INFO http.Http - http.timeout = 10000 >> 2011-12-19 18:42:19,545 INFO http.Http - http.content.limit = -1 >> 2011-12-19 18:42:19,545 INFO http.Http - http.agent = >> crawler-nutch/Nutch-1.4 (Crawler; [email protected]) >> 2011-12-19 18:42:19,545 INFO http.Http - http.accept.language = >> en-us,en-gb,en;q=0.7,*;q=0.3 >> 2011-12-19 18:42:20,545 INFO fetcher.Fetcher - -activeThreads=10, >> spinWaiting=10, fetchQueues.totalSize=13 >> 2011-12-19 18:42:21,548 INFO fetcher.Fetcher - -activeThreads=10, >> spinWaiting=10, fetchQueues.totalSize=13 >> 2011-12-19 18:42:22,550 INFO fetcher.Fetcher - -activeThreads=10, >> spinWaiting=10, fetchQueues.totalSize=13 >> 2011-12-19 18:42:23,552 INFO fetcher.Fetcher - -activeThreads=10, >> spinWaiting=10, fetchQueues.totalSize=13 >> 2011-12-19 18:42:24,554 INFO fetcher.Fetcher - -activeThreads=10, >> spinWaiting=10, fetchQueues.totalSize=13 >> >> Thanks! >> >> --Chris

