Half-way, it's clear in the log. Is your document a redirect, i've not yet seen such a log line before.
* haven't double-checked source code > Not sure where fetching starts... > > 2011-12-19 20:13:53,223 INFO crawl.FetchScheduleFactory - Using > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule > 2011-12-19 20:13:53,223 INFO crawl.AbstractFetchSchedule - > defaultInterval=2592000 > 2011-12-19 20:13:53,223 INFO crawl.AbstractFetchSchedule - > maxInterval=7776000 2011-12-19 20:13:53,261 INFO regex.RegexURLNormalizer > - can't find rules for scope 'partition', using default > 2011-12-19 20:13:53,394 INFO crawl.FetchScheduleFactory - Using > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule > 2011-12-19 20:13:53,395 INFO crawl.AbstractFetchSchedule - > defaultInterval=2592000 > 2011-12-19 20:13:53,395 INFO crawl.AbstractFetchSchedule - > maxInterval=7776000 2011-12-19 20:13:53,399 INFO regex.RegexURLNormalizer > - can't find rules for scope 'generate_host_count', using default > 2011-12-19 20:13:54,474 INFO crawl.Generator - Generator: > Partitioning selected urls for politeness. > 2011-12-19 20:13:55,479 INFO crawl.Generator - Generator: segment: > /cdda/nutch/crawl/segments/20111219201355 > 2011-12-19 20:13:56,537 INFO regex.RegexURLNormalizer - can't find > rules for scope 'partition', using default > 2011-12-19 20:13:56,939 INFO crawl.Generator - Generator: finished at > 2011-12-19 20:13:56, elapsed: 00:00:05 > 2011-12-19 20:13:57,695 INFO fetcher.Fetcher - Fetcher: starting at > 2011-12-19 20:13:57 > 2011-12-19 20:13:57,695 INFO fetcher.Fetcher - Fetcher: segment: > /nutch/crawl/segments/20111219201355 > 2011-12-19 20:13:58,743 INFO fetcher.Fetcher - Using queue mode : byHost > 2011-12-19 20:13:58,744 INFO fetcher.Fetcher - Fetcher: threads: 10 > 2011-12-19 20:13:58,744 INFO fetcher.Fetcher - Fetcher: time-out divisor: > 2 2011-12-19 20:13:58,749 DEBUG fetcher.Fetcher - -feeding 500 input urls > ... 2011-12-19 20:13:58,756 INFO plugin.PluginRepository - Plugins: > looking in: /nutch/plugins > 2011-12-19 20:13:58,774 INFO fetcher.Fetcher - QueueFeeder finished: > total 1 records + hit by time limit :0 > <cut plugin loader stuff, can push this if you need it> > 2011-12-19 20:13:59,036 INFO fetcher.Fetcher - Using queue mode : byHost > 2011-12-19 20:13:59,036 INFO fetcher.Fetcher - Using queue mode : byHost > 2011-12-19 20:13:59,036 INFO fetcher.Fetcher - fetching > http://wnspstg8o.imostg.intelink.gov/sites/mlogic/Shared > Documents/Alpha.docx > 2011-12-19 20:13:59,036 DEBUG fetcher.Fetcher - redirectCount=0 > 2011-12-19 20:13:59,038 INFO fetcher.Fetcher - Using queue mode : byHost > 2011-12-19 20:13:59,039 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2011-12-19 20:13:59,039 INFO fetcher.Fetcher - Using queue mode : byHost > 2011-12-19 20:13:59,039 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2011-12-19 20:13:59,040 INFO fetcher.Fetcher - Using queue mode : byHost > 2011-12-19 20:13:59,040 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2011-12-19 20:13:59,040 INFO fetcher.Fetcher - Using queue mode : byHost > 2011-12-19 20:13:59,041 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2011-12-19 20:13:59,041 INFO fetcher.Fetcher - Using queue mode : byHost > 2011-12-19 20:13:59,041 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2011-12-19 20:13:59,041 INFO fetcher.Fetcher - Using queue mode : byHost > 2011-12-19 20:13:59,042 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2011-12-19 20:13:59,042 INFO fetcher.Fetcher - Using queue mode : byHost > 2011-12-19 20:13:59,042 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2011-12-19 20:13:59,042 INFO fetcher.Fetcher - Using queue mode : byHost > 2011-12-19 20:13:59,043 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2011-12-19 20:13:59,043 INFO fetcher.Fetcher - Fetcher: throughput > threshold: -1 > 2011-12-19 20:13:59,043 INFO fetcher.Fetcher - Fetcher: throughput > threshold retries: 5 > 2011-12-19 20:13:59,043 INFO http.Http - http.proxy.host = null > 2011-12-19 20:13:59,043 INFO http.Http - http.proxy.port = 8080 > 2011-12-19 20:13:59,043 INFO http.Http - http.timeout = 10000 > 2011-12-19 20:13:59,043 INFO http.Http - http.content.limit = -1 > 2011-12-19 20:13:59,043 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2011-12-19 20:13:59,043 INFO http.Http - http.agent = > google-robot-intelink/Nutch-1.4 (CDDA Crawler; search.intelink.gov; > [email protected]) > 2011-12-19 20:13:59,043 INFO http.Http - http.accept.language = > en-us,en-gb,en;q=0.7,*;q=0.3 > 2011-12-19 20:13:59,380 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=0 > 2011-12-19 20:14:00,050 INFO fetcher.Fetcher - -activeThreads=0, > spinWaiting=0, fetchQueues.totalSize=0 > 2011-12-19 20:14:00,050 INFO fetcher.Fetcher - -activeThreads=0 > 2011-12-19 20:14:00,372 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your platform... using builtin-java classes > where applicable > 2011-12-19 20:14:01,451 INFO fetcher.Fetcher - Fetcher: finished at > 2011-12-19 20:14:01, elapsed: 00:00:03 > 2011-12-19 20:14:02,197 INFO parse.ParseSegment - ParseSegment: > starting at 2011-12-19 20:14:02 > 2011-12-19 20:14:02,198 INFO parse.ParseSegment - ParseSegment: > segment: /cdda/nutch/crawl/segments/20111219201355 > 2011-12-19 20:14:03,062 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your platform... using builtin-java classes > where applicable > 2 > ...is that enough for the fetch logs? It's all crawl/generator > messages after that. > > > I ran: > ./nutch freegen ../urls/ ./test-segments > ./nutch readseg -dump ./test-segments/ ./segment-output > > I got an error: > Exception in thread "main" > org.apache.hadoop.mapred.InvalidInputException: Input path does not > exist: file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_generate > Input path does not exist: > file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_fetch > Input path does not exist: > file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_parse > Input path does not exist: > file:/data/search/cdda/nutch-1.4/bin/test-segments/content > Input path does not exist: > file:/data/search/cdda/nutch-1.4/bin/test-segments/parse_data > Input path does not exist: > file:/data/search/cdda/nutch-1.4/bin/test-segments/parse_text > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1 > 90) at > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn > putFormat.java:44) at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:20 > 1) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at > org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:225) at > org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:564) > > So do I need to run the generator step in the middle? How is this > different than just doing a crawl? > > Thanks! > > -- Chris > > > > On Mon, Dec 19, 2011 at 3:22 PM, Markus Jelsma > > <[email protected]> wrote: > >> I'm a little confused -- should I set up a whole other instance of > >> nutch, crawldb, etc? > > > > Yes, i use clean instances for quick testing. Makes things easy > > sometimes. > > > >> Set the log to trace, I think this helps to tell why..... > >> > >> 2011-12-19 20:14:10,716 INFO crawl.FetchScheduleFactory - Using > >> FetchSchedule impl: > >> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,716 > >> INFO crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19 > >> 20:14:10,716 INFO crawl.AbstractFetchSchedule - > >> maxInterval=77760002011-12-19 20:14:10,738 DEBUG crawl.Generator - > >> -shouldFetch rejected 'http://url/Alpha.docx', > >> fetchTime=1328213639379, curTime=13247576493782011-12-19 20:14:10,843 > >> INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: > >> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,843 > >> INFO crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19 > >> 20:14:10,843 INFO crawl.AbstractFetchSchedule - > >> maxInterval=77760002011-12-19 20:14:11,145 WARN crawl.Generator - > >> Generator: 0 records selected for fetching, exiting ... > > > > Now, this is the generator indeed but you need to fetcher logs. > > > >> Now, before I ran this I cleared the crawldb, linkdb & segments, but I > >> still got a rejected because it is before the next fetch time...why do > >> I get that? How do I set it up to always crawl all the docs? (Not > >> practical for production, but it's what I want when testing...) > > > > As i said, create segments using the freegen tool. It takes an input dir > > with seed files, just as your initial inject. Or can also inject files > > and give them meta data with a very low fetch interval so Nutch will > > crawl it each time, i usually take this approach in small tests. > > > > http://url<TAB>nutch.fetchInterval=10 > > > > The URL will be selected by the generator all the time because of this > > low fetch interval. > > > >> -- Chris > >> > >> > >> > >> On Mon, Dec 19, 2011 at 2:42 PM, Markus Jelsma > >> > >> <[email protected]> wrote: > >> >> > Hmm, the status db_gone prevents it from being indexed, of course. > >> >> > It is perfectly possible for the checkers to pass but that the > >> >> > fetcher will fail. There may have been an error and i remeber you > >> >> > using a proxy earlier, that's likely the problem here too. The > >> >> > checkers don't use proxy configurations. > >> >> > > >> >> > Check the logs to make sure. > >> >> > >> >> I cut out the proxy, and that let me get as far as I have now. > >> >> Having that in place prevents me from crawling the local > >> >> source...is there any way to be able to crawl both the inside & > >> >> outside networks? [separate issue, but something that I'll need this > >> >> to do] > >> > > >> > Not that i know of. You can use separate configs but this is tricky. > >> > Better use separate crawldb's configs etc. > >> > > >> >> > That's good. But remember, to pass it _must_ match regex prefixed > >> >> > by a +. This, however, is not your problem because in that case it > >> >> > wouldn't have ended up in the CrawlDB at all. > >> >> > >> >> I have two +'s that it should match on, including +.* > >> > > >> > That'll do. > >> > > >> >> > Check the fetcher output thoroughly. Grep around. You should find > >> >> > it. > >> >> > >> >> What exactly am I grepping for? > >> >> This is the block between the doc and the next one that it tries to > >> >> crawl.... > >> > > >> > Hmm, that looks fine but can still indicate a 404 because a 404 is not > >> > an error. Does debug say anything? You can set the level for the > >> > Fetcher in conf/log4j.properties. You can use freegen tool to > >> > generate a segments from some input text for tests. > >> > > >> >> 2011-12-19 18:42:19,538 INFO fetcher.Fetcher - fetching > >> >> http://url/Alpha.docx 2011-12-19 18:42:19,538 INFO fetcher.Fetcher - > >> >> Using queue mode : byHost 2011-12-19 18:42:19,539 INFO > >> >> fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,540 > >> >> INFO > >> >> fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,541 > >> >> INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 > >> >> 18:42:19,541 INFO fetcher.Fetcher - Using queue mode : byHost > >> >> 2011-12-19 18:42:19,541 INFO fetcher.Fetcher - Using queue mode : > >> >> byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - Using queue > >> >> mode > >> >> > >> >> : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - Using queue > >> >> > >> >> mode : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - > >> >> Fetcher: throughput threshold: -1 > >> >> 2011-12-19 18:42:19,543 INFO fetcher.Fetcher - Fetcher: throughput > >> >> threshold retries: 5 > >> >> 2011-12-19 18:42:19,545 INFO http.Http - http.proxy.host = null > >> >> 2011-12-19 18:42:19,545 INFO http.Http - http.proxy.port = 8080 > >> >> 2011-12-19 18:42:19,545 INFO http.Http - http.timeout = 10000 > >> >> 2011-12-19 18:42:19,545 INFO http.Http - http.content.limit = -1 > >> >> 2011-12-19 18:42:19,545 INFO http.Http - http.agent = > >> >> crawler-nutch/Nutch-1.4 (Crawler; [email protected]) > >> >> 2011-12-19 18:42:19,545 INFO http.Http - http.accept.language = > >> >> en-us,en-gb,en;q=0.7,*;q=0.3 > >> >> 2011-12-19 18:42:20,545 INFO fetcher.Fetcher - -activeThreads=10, > >> >> spinWaiting=10, fetchQueues.totalSize=13 > >> >> 2011-12-19 18:42:21,548 INFO fetcher.Fetcher - -activeThreads=10, > >> >> spinWaiting=10, fetchQueues.totalSize=13 > >> >> 2011-12-19 18:42:22,550 INFO fetcher.Fetcher - -activeThreads=10, > >> >> spinWaiting=10, fetchQueues.totalSize=13 > >> >> 2011-12-19 18:42:23,552 INFO fetcher.Fetcher - -activeThreads=10, > >> >> spinWaiting=10, fetchQueues.totalSize=13 > >> >> 2011-12-19 18:42:24,554 INFO fetcher.Fetcher - -activeThreads=10, > >> >> spinWaiting=10, fetchQueues.totalSize=13 > >> >> > >> >> Thanks! > >> >> > >> >> --Chris

