Re: Missing document

Markus Jelsma Mon, 19 Dec 2011 12:24:42 -0800

> I'm a little confused -- should I set up a whole other instance of
> nutch, crawldb, etc?


Yes, i use clean instances for quick testing. Makes things easy sometimes.

> 
> Set the log to trace, I think this helps to tell why.....
> 
> 2011-12-19 20:14:10,716 INFO  crawl.FetchScheduleFactory - Using
> FetchSchedule impl:
> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,716
> INFO  crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
> 20:14:10,716 INFO  crawl.AbstractFetchSchedule -
> maxInterval=77760002011-12-19 20:14:10,738 DEBUG crawl.Generator -
> -shouldFetch rejected 'http://url/Alpha.docx',
> fetchTime=1328213639379, curTime=13247576493782011-12-19 20:14:10,843
> INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl:
> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,843
> INFO  crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
> 20:14:10,843 INFO  crawl.AbstractFetchSchedule -
> maxInterval=77760002011-12-19 20:14:11,145 WARN  crawl.Generator -
> Generator: 0 records selected for fetching, exiting ...

Now, this is the generator indeed but you need to fetcher logs. 

> Now, before I ran this I cleared the crawldb, linkdb & segments, but I
> still got a rejected because it is before the next fetch time...why do
> I get that?  How do I set it up to always crawl all the docs?  (Not
> practical for production, but it's what I want when testing...)

As i said, create segments using the freegen tool. It takes an input dir with 
seed files, just as your initial inject. Or can also inject files and give 
them meta data with a very low fetch interval so Nutch will crawl it each 
time, i usually take this approach in small tests.

http://url<TAB>nutch.fetchInterval=10

The URL will be selected by the generator all the time because of this low 
fetch interval.

> -- Chris
> 
> 
> 
> On Mon, Dec 19, 2011 at 2:42 PM, Markus Jelsma
> 
> <[email protected]> wrote:
> >> > Hmm, the status db_gone prevents it from being indexed, of course. It
> >> > is perfectly possible for the checkers to pass but that the fetcher
> >> > will fail. There may have been an error and i remeber you using a
> >> > proxy earlier, that's likely the problem here too. The checkers don't
> >> > use proxy configurations.
> >> > 
> >> > Check the logs to make sure.
> >> 
> >> I cut out the proxy, and that let me get as far as I have now.  Having
> >> that in place prevents me from crawling the local source...is there
> >> any way to be able to crawl both the inside & outside networks?
> >> [separate issue, but something that I'll need this to do]
> > 
> > Not that i know of. You can use separate configs but this is tricky.
> > Better use separate crawldb's configs etc.
> > 
> >> > That's good. But remember, to pass it _must_ match regex prefixed by a
> >> > +. This, however, is not your problem because in that case it
> >> > wouldn't have ended up in the CrawlDB at all.
> >> 
> >> I have two +'s that it should match on, including +.*
> > 
> > That'll do.
> > 
> >> > Check the fetcher output thoroughly. Grep around. You should find it.
> >> 
> >> What exactly am I grepping for?
> >> This is the block between the doc and the next one that it tries to
> >> crawl....
> > 
> > Hmm, that looks fine but can still indicate a 404 because a 404 is not an
> > error. Does debug say anything? You can set the level for the Fetcher in
> > conf/log4j.properties. You can use freegen tool to generate a segments
> > from some input text for tests.
> > 
> >> 2011-12-19 18:42:19,538 INFO  fetcher.Fetcher - fetching
> >> http://url/Alpha.docx 2011-12-19 18:42:19,538 INFO  fetcher.Fetcher -
> >> Using queue mode : byHost 2011-12-19 18:42:19,539 INFO  fetcher.Fetcher
> >> - Using queue mode : byHost 2011-12-19 18:42:19,540 INFO
> >>  fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,541
> >> INFO  fetcher.Fetcher - Using queue mode : byHost 2011-12-19
> >> 18:42:19,541 INFO  fetcher.Fetcher - Using queue mode : byHost
> >> 2011-12-19 18:42:19,541 INFO  fetcher.Fetcher - Using queue mode :
> >> byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher - Using queue mode
> >> : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher - Using queue
> >> mode : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher - Fetcher:
> >> throughput threshold: -1
> >> 2011-12-19 18:42:19,543 INFO  fetcher.Fetcher - Fetcher: throughput
> >> threshold retries: 5
> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.host = null
> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.port = 8080
> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.timeout = 10000
> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.content.limit = -1
> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.agent =
> >> crawler-nutch/Nutch-1.4 (Crawler; [email protected])
> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.accept.language =
> >> en-us,en-gb,en;q=0.7,*;q=0.3
> >> 2011-12-19 18:42:20,545 INFO  fetcher.Fetcher - -activeThreads=10,
> >> spinWaiting=10, fetchQueues.totalSize=13
> >> 2011-12-19 18:42:21,548 INFO  fetcher.Fetcher - -activeThreads=10,
> >> spinWaiting=10, fetchQueues.totalSize=13
> >> 2011-12-19 18:42:22,550 INFO  fetcher.Fetcher - -activeThreads=10,
> >> spinWaiting=10, fetchQueues.totalSize=13
> >> 2011-12-19 18:42:23,552 INFO  fetcher.Fetcher - -activeThreads=10,
> >> spinWaiting=10, fetchQueues.totalSize=13
> >> 2011-12-19 18:42:24,554 INFO  fetcher.Fetcher - -activeThreads=10,
> >> spinWaiting=10, fetchQueues.totalSize=13
> >> 
> >> Thanks!
> >> 
> >> --Chris

Re: Missing document

Reply via email to