Re: Missing document

Christopher Gross Mon, 19 Dec 2011 12:20:00 -0800

I'm a little confused -- should I set up a whole other instance of
nutch, crawldb, etc?


Set the log to trace, I think this helps to tell why.....

2011-12-19 20:14:10,716 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl:
org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,716
INFO  crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
20:14:10,716 INFO  crawl.AbstractFetchSchedule -
maxInterval=77760002011-12-19 20:14:10,738 DEBUG crawl.Generator -
-shouldFetch rejected 'http://url/Alpha.docx',
fetchTime=1328213639379, curTime=13247576493782011-12-19 20:14:10,843
INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl:
org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,843
INFO  crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
20:14:10,843 INFO  crawl.AbstractFetchSchedule -
maxInterval=77760002011-12-19 20:14:11,145 WARN  crawl.Generator -
Generator: 0 records selected for fetching, exiting ...
Now, before I ran this I cleared the crawldb, linkdb & segments, but I
still got a rejected because it is before the next fetch time...why do
I get that?  How do I set it up to always crawl all the docs?  (Not
practical for production, but it's what I want when testing...)
-- Chris



On Mon, Dec 19, 2011 at 2:42 PM, Markus Jelsma
<[email protected]> wrote:
>
>> > Hmm, the status db_gone prevents it from being indexed, of course. It is
>> > perfectly possible for the checkers to pass but that the fetcher will
>> > fail. There may have been an error and i remeber you using a proxy
>> > earlier, that's likely the problem here too. The checkers don't use
>> > proxy configurations.
>> >
>> > Check the logs to make sure.
>>
>> I cut out the proxy, and that let me get as far as I have now.  Having
>> that in place prevents me from crawling the local source...is there
>> any way to be able to crawl both the inside & outside networks?
>> [separate issue, but something that I'll need this to do]
>
> Not that i know of. You can use separate configs but this is tricky. Better
> use separate crawldb's configs etc.
>
>>
>> > That's good. But remember, to pass it _must_ match regex prefixed by a +.
>> > This, however, is not your problem because in that case it wouldn't have
>> > ended up in the CrawlDB at all.
>>
>> I have two +'s that it should match on, including +.*
>
> That'll do.
>
>>
>> > Check the fetcher output thoroughly. Grep around. You should find it.
>>
>> What exactly am I grepping for?
>> This is the block between the doc and the next one that it tries to
>> crawl....
>
> Hmm, that looks fine but can still indicate a 404 because a 404 is not an
> error. Does debug say anything? You can set the level for the Fetcher in
> conf/log4j.properties. You can use freegen tool to generate a segments from
> some input text for tests.
>
>>
>> 2011-12-19 18:42:19,538 INFO  fetcher.Fetcher - fetching
>> http://url/Alpha.docx 2011-12-19 18:42:19,538 INFO  fetcher.Fetcher -
>> Using queue mode : byHost 2011-12-19 18:42:19,539 INFO  fetcher.Fetcher -
>> Using queue mode : byHost 2011-12-19 18:42:19,540 INFO  fetcher.Fetcher -
>> Using queue mode : byHost 2011-12-19 18:42:19,541 INFO  fetcher.Fetcher -
>> Using queue mode : byHost 2011-12-19 18:42:19,541 INFO  fetcher.Fetcher -
>> Using queue mode : byHost 2011-12-19 18:42:19,541 INFO  fetcher.Fetcher -
>> Using queue mode : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher -
>> Using queue mode : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher -
>> Using queue mode : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher -
>> Fetcher: throughput threshold: -1
>> 2011-12-19 18:42:19,543 INFO  fetcher.Fetcher - Fetcher: throughput
>> threshold retries: 5
>> 2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.host = null
>> 2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.port = 8080
>> 2011-12-19 18:42:19,545 INFO  http.Http - http.timeout = 10000
>> 2011-12-19 18:42:19,545 INFO  http.Http - http.content.limit = -1
>> 2011-12-19 18:42:19,545 INFO  http.Http - http.agent =
>> crawler-nutch/Nutch-1.4 (Crawler; [email protected])
>> 2011-12-19 18:42:19,545 INFO  http.Http - http.accept.language =
>> en-us,en-gb,en;q=0.7,*;q=0.3
>> 2011-12-19 18:42:20,545 INFO  fetcher.Fetcher - -activeThreads=10,
>> spinWaiting=10, fetchQueues.totalSize=13
>> 2011-12-19 18:42:21,548 INFO  fetcher.Fetcher - -activeThreads=10,
>> spinWaiting=10, fetchQueues.totalSize=13
>> 2011-12-19 18:42:22,550 INFO  fetcher.Fetcher - -activeThreads=10,
>> spinWaiting=10, fetchQueues.totalSize=13
>> 2011-12-19 18:42:23,552 INFO  fetcher.Fetcher - -activeThreads=10,
>> spinWaiting=10, fetchQueues.totalSize=13
>> 2011-12-19 18:42:24,554 INFO  fetcher.Fetcher - -activeThreads=10,
>> spinWaiting=10, fetchQueues.totalSize=13
>>
>> Thanks!
>>
>> --Chris

Re: Missing document

Reply via email to