> I'm a little confused -- should I set up a whole other instance of > nutch, crawldb, etc?
Yes, i use clean instances for quick testing. Makes things easy sometimes. > > Set the log to trace, I think this helps to tell why..... > > 2011-12-19 20:14:10,716 INFO crawl.FetchScheduleFactory - Using > FetchSchedule impl: > org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,716 > INFO crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19 > 20:14:10,716 INFO crawl.AbstractFetchSchedule - > maxInterval=77760002011-12-19 20:14:10,738 DEBUG crawl.Generator - > -shouldFetch rejected 'http://url/Alpha.docx', > fetchTime=1328213639379, curTime=13247576493782011-12-19 20:14:10,843 > INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: > org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,843 > INFO crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19 > 20:14:10,843 INFO crawl.AbstractFetchSchedule - > maxInterval=77760002011-12-19 20:14:11,145 WARN crawl.Generator - > Generator: 0 records selected for fetching, exiting ... Now, this is the generator indeed but you need to fetcher logs. > Now, before I ran this I cleared the crawldb, linkdb & segments, but I > still got a rejected because it is before the next fetch time...why do > I get that? How do I set it up to always crawl all the docs? (Not > practical for production, but it's what I want when testing...) As i said, create segments using the freegen tool. It takes an input dir with seed files, just as your initial inject. Or can also inject files and give them meta data with a very low fetch interval so Nutch will crawl it each time, i usually take this approach in small tests. http://url<TAB>nutch.fetchInterval=10 The URL will be selected by the generator all the time because of this low fetch interval. > -- Chris > > > > On Mon, Dec 19, 2011 at 2:42 PM, Markus Jelsma > > <[email protected]> wrote: > >> > Hmm, the status db_gone prevents it from being indexed, of course. It > >> > is perfectly possible for the checkers to pass but that the fetcher > >> > will fail. There may have been an error and i remeber you using a > >> > proxy earlier, that's likely the problem here too. The checkers don't > >> > use proxy configurations. > >> > > >> > Check the logs to make sure. > >> > >> I cut out the proxy, and that let me get as far as I have now. Having > >> that in place prevents me from crawling the local source...is there > >> any way to be able to crawl both the inside & outside networks? > >> [separate issue, but something that I'll need this to do] > > > > Not that i know of. You can use separate configs but this is tricky. > > Better use separate crawldb's configs etc. > > > >> > That's good. But remember, to pass it _must_ match regex prefixed by a > >> > +. This, however, is not your problem because in that case it > >> > wouldn't have ended up in the CrawlDB at all. > >> > >> I have two +'s that it should match on, including +.* > > > > That'll do. > > > >> > Check the fetcher output thoroughly. Grep around. You should find it. > >> > >> What exactly am I grepping for? > >> This is the block between the doc and the next one that it tries to > >> crawl.... > > > > Hmm, that looks fine but can still indicate a 404 because a 404 is not an > > error. Does debug say anything? You can set the level for the Fetcher in > > conf/log4j.properties. You can use freegen tool to generate a segments > > from some input text for tests. > > > >> 2011-12-19 18:42:19,538 INFO fetcher.Fetcher - fetching > >> http://url/Alpha.docx 2011-12-19 18:42:19,538 INFO fetcher.Fetcher - > >> Using queue mode : byHost 2011-12-19 18:42:19,539 INFO fetcher.Fetcher > >> - Using queue mode : byHost 2011-12-19 18:42:19,540 INFO > >> fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,541 > >> INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 > >> 18:42:19,541 INFO fetcher.Fetcher - Using queue mode : byHost > >> 2011-12-19 18:42:19,541 INFO fetcher.Fetcher - Using queue mode : > >> byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - Using queue mode > >> : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - Using queue > >> mode : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - Fetcher: > >> throughput threshold: -1 > >> 2011-12-19 18:42:19,543 INFO fetcher.Fetcher - Fetcher: throughput > >> threshold retries: 5 > >> 2011-12-19 18:42:19,545 INFO http.Http - http.proxy.host = null > >> 2011-12-19 18:42:19,545 INFO http.Http - http.proxy.port = 8080 > >> 2011-12-19 18:42:19,545 INFO http.Http - http.timeout = 10000 > >> 2011-12-19 18:42:19,545 INFO http.Http - http.content.limit = -1 > >> 2011-12-19 18:42:19,545 INFO http.Http - http.agent = > >> crawler-nutch/Nutch-1.4 (Crawler; [email protected]) > >> 2011-12-19 18:42:19,545 INFO http.Http - http.accept.language = > >> en-us,en-gb,en;q=0.7,*;q=0.3 > >> 2011-12-19 18:42:20,545 INFO fetcher.Fetcher - -activeThreads=10, > >> spinWaiting=10, fetchQueues.totalSize=13 > >> 2011-12-19 18:42:21,548 INFO fetcher.Fetcher - -activeThreads=10, > >> spinWaiting=10, fetchQueues.totalSize=13 > >> 2011-12-19 18:42:22,550 INFO fetcher.Fetcher - -activeThreads=10, > >> spinWaiting=10, fetchQueues.totalSize=13 > >> 2011-12-19 18:42:23,552 INFO fetcher.Fetcher - -activeThreads=10, > >> spinWaiting=10, fetchQueues.totalSize=13 > >> 2011-12-19 18:42:24,554 INFO fetcher.Fetcher - -activeThreads=10, > >> spinWaiting=10, fetchQueues.totalSize=13 > >> > >> Thanks! > >> > >> --Chris

