Re: Missing document

Markus Jelsma Mon, 19 Dec 2011 11:43:55 -0800

> > Hmm, the status db_gone prevents it from being indexed, of course. It is
> > perfectly possible for the checkers to pass but that the fetcher will
> > fail. There may have been an error and i remeber you using a proxy
> > earlier, that's likely the problem here too. The checkers don't use
> > proxy configurations.
> > 
> > Check the logs to make sure.
> 
> I cut out the proxy, and that let me get as far as I have now.  Having
> that in place prevents me from crawling the local source...is there
> any way to be able to crawl both the inside & outside networks?
> [separate issue, but something that I'll need this to do]


Not that i know of. You can use separate configs but this is tricky. Better 
use separate crawldb's configs etc.
 
> 
> > That's good. But remember, to pass it _must_ match regex prefixed by a +.
> > This, however, is not your problem because in that case it wouldn't have
> > ended up in the CrawlDB at all.
> 
> I have two +'s that it should match on, including +.*

That'll do.

> 
> > Check the fetcher output thoroughly. Grep around. You should find it.
> 
> What exactly am I grepping for?
> This is the block between the doc and the next one that it tries to
> crawl....

Hmm, that looks fine but can still indicate a 404 because a 404 is not an 
error. Does debug say anything? You can set the level for the Fetcher in 
conf/log4j.properties. You can use freegen tool to generate a segments from 
some input text for tests.

> 
> 2011-12-19 18:42:19,538 INFO  fetcher.Fetcher - fetching
> http://url/Alpha.docx 2011-12-19 18:42:19,538 INFO  fetcher.Fetcher -
> Using queue mode : byHost 2011-12-19 18:42:19,539 INFO  fetcher.Fetcher -
> Using queue mode : byHost 2011-12-19 18:42:19,540 INFO  fetcher.Fetcher -
> Using queue mode : byHost 2011-12-19 18:42:19,541 INFO  fetcher.Fetcher -
> Using queue mode : byHost 2011-12-19 18:42:19,541 INFO  fetcher.Fetcher -
> Using queue mode : byHost 2011-12-19 18:42:19,541 INFO  fetcher.Fetcher -
> Using queue mode : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher -
> Using queue mode : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher -
> Using queue mode : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher -
> Fetcher: throughput threshold: -1
> 2011-12-19 18:42:19,543 INFO  fetcher.Fetcher - Fetcher: throughput
> threshold retries: 5
> 2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.host = null
> 2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.port = 8080
> 2011-12-19 18:42:19,545 INFO  http.Http - http.timeout = 10000
> 2011-12-19 18:42:19,545 INFO  http.Http - http.content.limit = -1
> 2011-12-19 18:42:19,545 INFO  http.Http - http.agent =
> crawler-nutch/Nutch-1.4 (Crawler; [email protected])
> 2011-12-19 18:42:19,545 INFO  http.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2011-12-19 18:42:20,545 INFO  fetcher.Fetcher - -activeThreads=10,
> spinWaiting=10, fetchQueues.totalSize=13
> 2011-12-19 18:42:21,548 INFO  fetcher.Fetcher - -activeThreads=10,
> spinWaiting=10, fetchQueues.totalSize=13
> 2011-12-19 18:42:22,550 INFO  fetcher.Fetcher - -activeThreads=10,
> spinWaiting=10, fetchQueues.totalSize=13
> 2011-12-19 18:42:23,552 INFO  fetcher.Fetcher - -activeThreads=10,
> spinWaiting=10, fetchQueues.totalSize=13
> 2011-12-19 18:42:24,554 INFO  fetcher.Fetcher - -activeThreads=10,
> spinWaiting=10, fetchQueues.totalSize=13
> 
> Thanks!
> 
> --Chris

Re: Missing document

Reply via email to