> > Hmm, the status db_gone prevents it from being indexed, of course. It is > perfectly possible for the checkers to pass but that the fetcher will fail. > There may have been an error and i remeber you using a proxy earlier, that's > likely the problem here too. The checkers don't use proxy configurations. > > Check the logs to make sure. >
I cut out the proxy, and that let me get as far as I have now. Having that in place prevents me from crawling the local source...is there any way to be able to crawl both the inside & outside networks? [separate issue, but something that I'll need this to do] > > That's good. But remember, to pass it _must_ match regex prefixed by a +. > This, however, is not your problem because in that case it wouldn't have ended > up in the CrawlDB at all. I have two +'s that it should match on, including +.* > > Check the fetcher output thoroughly. Grep around. You should find it. > What exactly am I grepping for? This is the block between the doc and the next one that it tries to crawl.... 2011-12-19 18:42:19,538 INFO fetcher.Fetcher - fetching http://url/Alpha.docx 2011-12-19 18:42:19,538 INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,539 INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,540 INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,541 INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,541 INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,541 INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - Fetcher: throughput threshold: -1 2011-12-19 18:42:19,543 INFO fetcher.Fetcher - Fetcher: throughput threshold retries: 5 2011-12-19 18:42:19,545 INFO http.Http - http.proxy.host = null 2011-12-19 18:42:19,545 INFO http.Http - http.proxy.port = 8080 2011-12-19 18:42:19,545 INFO http.Http - http.timeout = 10000 2011-12-19 18:42:19,545 INFO http.Http - http.content.limit = -1 2011-12-19 18:42:19,545 INFO http.Http - http.agent = crawler-nutch/Nutch-1.4 (Crawler; [email protected]) 2011-12-19 18:42:19,545 INFO http.Http - http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3 2011-12-19 18:42:20,545 INFO fetcher.Fetcher - -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=13 2011-12-19 18:42:21,548 INFO fetcher.Fetcher - -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=13 2011-12-19 18:42:22,550 INFO fetcher.Fetcher - -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=13 2011-12-19 18:42:23,552 INFO fetcher.Fetcher - -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=13 2011-12-19 18:42:24,554 INFO fetcher.Fetcher - -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=13 Thanks! --Chris

