>
> Hmm, the status db_gone prevents it from being indexed, of course. It is
> perfectly possible for the checkers to pass but that the fetcher will fail.
> There may have been an error and i remeber you using a proxy earlier, that's
> likely the problem here too. The checkers don't use proxy configurations.
>
> Check the logs to make sure.
>

I cut out the proxy, and that let me get as far as I have now.  Having
that in place prevents me from crawling the local source...is there
any way to be able to crawl both the inside & outside networks?
[separate issue, but something that I'll need this to do]

>
> That's good. But remember, to pass it _must_ match regex prefixed by a +.
> This, however, is not your problem because in that case it wouldn't have ended
> up in the CrawlDB at all.

I have two +'s that it should match on, including +.*

>
> Check the fetcher output thoroughly. Grep around. You should find it.
>

What exactly am I grepping for?
This is the block between the doc and the next one that it tries to crawl....

2011-12-19 18:42:19,538 INFO  fetcher.Fetcher - fetching http://url/Alpha.docx
2011-12-19 18:42:19,538 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 18:42:19,539 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 18:42:19,540 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 18:42:19,541 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 18:42:19,541 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 18:42:19,541 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 18:42:19,542 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 18:42:19,542 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 18:42:19,542 INFO  fetcher.Fetcher - Fetcher: throughput
threshold: -1
2011-12-19 18:42:19,543 INFO  fetcher.Fetcher - Fetcher: throughput
threshold retries: 5
2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.host = null
2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.port = 8080
2011-12-19 18:42:19,545 INFO  http.Http - http.timeout = 10000
2011-12-19 18:42:19,545 INFO  http.Http - http.content.limit = -1
2011-12-19 18:42:19,545 INFO  http.Http - http.agent =
crawler-nutch/Nutch-1.4 (Crawler; [email protected])
2011-12-19 18:42:19,545 INFO  http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2011-12-19 18:42:20,545 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=13
2011-12-19 18:42:21,548 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=13
2011-12-19 18:42:22,550 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=13
2011-12-19 18:42:23,552 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=13
2011-12-19 18:42:24,554 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=13

Thanks!

--Chris

Reply via email to