> > Hmm, the status db_gone prevents it from being indexed, of course. It is > > perfectly possible for the checkers to pass but that the fetcher will > > fail. There may have been an error and i remeber you using a proxy > > earlier, that's likely the problem here too. The checkers don't use > > proxy configurations. > > > > Check the logs to make sure. > > I cut out the proxy, and that let me get as far as I have now. Having > that in place prevents me from crawling the local source...is there > any way to be able to crawl both the inside & outside networks? > [separate issue, but something that I'll need this to do]
Not that i know of. You can use separate configs but this is tricky. Better use separate crawldb's configs etc. > > > That's good. But remember, to pass it _must_ match regex prefixed by a +. > > This, however, is not your problem because in that case it wouldn't have > > ended up in the CrawlDB at all. > > I have two +'s that it should match on, including +.* That'll do. > > > Check the fetcher output thoroughly. Grep around. You should find it. > > What exactly am I grepping for? > This is the block between the doc and the next one that it tries to > crawl.... Hmm, that looks fine but can still indicate a 404 because a 404 is not an error. Does debug say anything? You can set the level for the Fetcher in conf/log4j.properties. You can use freegen tool to generate a segments from some input text for tests. > > 2011-12-19 18:42:19,538 INFO fetcher.Fetcher - fetching > http://url/Alpha.docx 2011-12-19 18:42:19,538 INFO fetcher.Fetcher - > Using queue mode : byHost 2011-12-19 18:42:19,539 INFO fetcher.Fetcher - > Using queue mode : byHost 2011-12-19 18:42:19,540 INFO fetcher.Fetcher - > Using queue mode : byHost 2011-12-19 18:42:19,541 INFO fetcher.Fetcher - > Using queue mode : byHost 2011-12-19 18:42:19,541 INFO fetcher.Fetcher - > Using queue mode : byHost 2011-12-19 18:42:19,541 INFO fetcher.Fetcher - > Using queue mode : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - > Using queue mode : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - > Using queue mode : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - > Fetcher: throughput threshold: -1 > 2011-12-19 18:42:19,543 INFO fetcher.Fetcher - Fetcher: throughput > threshold retries: 5 > 2011-12-19 18:42:19,545 INFO http.Http - http.proxy.host = null > 2011-12-19 18:42:19,545 INFO http.Http - http.proxy.port = 8080 > 2011-12-19 18:42:19,545 INFO http.Http - http.timeout = 10000 > 2011-12-19 18:42:19,545 INFO http.Http - http.content.limit = -1 > 2011-12-19 18:42:19,545 INFO http.Http - http.agent = > crawler-nutch/Nutch-1.4 (Crawler; [email protected]) > 2011-12-19 18:42:19,545 INFO http.Http - http.accept.language = > en-us,en-gb,en;q=0.7,*;q=0.3 > 2011-12-19 18:42:20,545 INFO fetcher.Fetcher - -activeThreads=10, > spinWaiting=10, fetchQueues.totalSize=13 > 2011-12-19 18:42:21,548 INFO fetcher.Fetcher - -activeThreads=10, > spinWaiting=10, fetchQueues.totalSize=13 > 2011-12-19 18:42:22,550 INFO fetcher.Fetcher - -activeThreads=10, > spinWaiting=10, fetchQueues.totalSize=13 > 2011-12-19 18:42:23,552 INFO fetcher.Fetcher - -activeThreads=10, > spinWaiting=10, fetchQueues.totalSize=13 > 2011-12-19 18:42:24,554 INFO fetcher.Fetcher - -activeThreads=10, > spinWaiting=10, fetchQueues.totalSize=13 > > Thanks! > > --Chris

