I don't think it's a redirect, unless SharePoint made it one. Any idea how to check for that?
-- Chris On Mon, Dec 19, 2011 at 5:15 PM, Markus Jelsma <[email protected]> wrote: > Half-way, it's clear in the log. Is your document a redirect, i've not yet > seen such a log line before. > > * haven't double-checked source code > > > >> Not sure where fetching starts... >> >> 2011-12-19 20:13:53,223 INFO crawl.FetchScheduleFactory - Using >> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule >> 2011-12-19 20:13:53,223 INFO crawl.AbstractFetchSchedule - >> defaultInterval=2592000 >> 2011-12-19 20:13:53,223 INFO crawl.AbstractFetchSchedule - >> maxInterval=7776000 2011-12-19 20:13:53,261 INFO regex.RegexURLNormalizer >> - can't find rules for scope 'partition', using default >> 2011-12-19 20:13:53,394 INFO crawl.FetchScheduleFactory - Using >> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule >> 2011-12-19 20:13:53,395 INFO crawl.AbstractFetchSchedule - >> defaultInterval=2592000 >> 2011-12-19 20:13:53,395 INFO crawl.AbstractFetchSchedule - >> maxInterval=7776000 2011-12-19 20:13:53,399 INFO regex.RegexURLNormalizer >> - can't find rules for scope 'generate_host_count', using default >> 2011-12-19 20:13:54,474 INFO crawl.Generator - Generator: >> Partitioning selected urls for politeness. >> 2011-12-19 20:13:55,479 INFO crawl.Generator - Generator: segment: >> /cdda/nutch/crawl/segments/20111219201355 >> 2011-12-19 20:13:56,537 INFO regex.RegexURLNormalizer - can't find >> rules for scope 'partition', using default >> 2011-12-19 20:13:56,939 INFO crawl.Generator - Generator: finished at >> 2011-12-19 20:13:56, elapsed: 00:00:05 >> 2011-12-19 20:13:57,695 INFO fetcher.Fetcher - Fetcher: starting at >> 2011-12-19 20:13:57 >> 2011-12-19 20:13:57,695 INFO fetcher.Fetcher - Fetcher: segment: >> /nutch/crawl/segments/20111219201355 >> 2011-12-19 20:13:58,743 INFO fetcher.Fetcher - Using queue mode : byHost >> 2011-12-19 20:13:58,744 INFO fetcher.Fetcher - Fetcher: threads: 10 >> 2011-12-19 20:13:58,744 INFO fetcher.Fetcher - Fetcher: time-out divisor: >> 2 2011-12-19 20:13:58,749 DEBUG fetcher.Fetcher - -feeding 500 input urls >> ... 2011-12-19 20:13:58,756 INFO plugin.PluginRepository - Plugins: >> looking in: /nutch/plugins >> 2011-12-19 20:13:58,774 INFO fetcher.Fetcher - QueueFeeder finished: >> total 1 records + hit by time limit :0 >> <cut plugin loader stuff, can push this if you need it> >> 2011-12-19 20:13:59,036 INFO fetcher.Fetcher - Using queue mode : byHost >> 2011-12-19 20:13:59,036 INFO fetcher.Fetcher - Using queue mode : byHost >> 2011-12-19 20:13:59,036 INFO fetcher.Fetcher - fetching >> http://wnspstg8o.imostg.intelink.gov/sites/mlogic/Shared >> Documents/Alpha.docx >> 2011-12-19 20:13:59,036 DEBUG fetcher.Fetcher - redirectCount=0 >> 2011-12-19 20:13:59,038 INFO fetcher.Fetcher - Using queue mode : byHost >> 2011-12-19 20:13:59,039 INFO fetcher.Fetcher - -finishing thread >> FetcherThread, activeThreads=1 >> 2011-12-19 20:13:59,039 INFO fetcher.Fetcher - Using queue mode : byHost >> 2011-12-19 20:13:59,039 INFO fetcher.Fetcher - -finishing thread >> FetcherThread, activeThreads=1 >> 2011-12-19 20:13:59,040 INFO fetcher.Fetcher - Using queue mode : byHost >> 2011-12-19 20:13:59,040 INFO fetcher.Fetcher - -finishing thread >> FetcherThread, activeThreads=1 >> 2011-12-19 20:13:59,040 INFO fetcher.Fetcher - Using queue mode : byHost >> 2011-12-19 20:13:59,041 INFO fetcher.Fetcher - -finishing thread >> FetcherThread, activeThreads=1 >> 2011-12-19 20:13:59,041 INFO fetcher.Fetcher - Using queue mode : byHost >> 2011-12-19 20:13:59,041 INFO fetcher.Fetcher - -finishing thread >> FetcherThread, activeThreads=1 >> 2011-12-19 20:13:59,041 INFO fetcher.Fetcher - Using queue mode : byHost >> 2011-12-19 20:13:59,042 INFO fetcher.Fetcher - -finishing thread >> FetcherThread, activeThreads=1 >> 2011-12-19 20:13:59,042 INFO fetcher.Fetcher - Using queue mode : byHost >> 2011-12-19 20:13:59,042 INFO fetcher.Fetcher - -finishing thread >> FetcherThread, activeThreads=1 >> 2011-12-19 20:13:59,042 INFO fetcher.Fetcher - Using queue mode : byHost >> 2011-12-19 20:13:59,043 INFO fetcher.Fetcher - -finishing thread >> FetcherThread, activeThreads=1 >> 2011-12-19 20:13:59,043 INFO fetcher.Fetcher - Fetcher: throughput >> threshold: -1 >> 2011-12-19 20:13:59,043 INFO fetcher.Fetcher - Fetcher: throughput >> threshold retries: 5 >> 2011-12-19 20:13:59,043 INFO http.Http - http.proxy.host = null >> 2011-12-19 20:13:59,043 INFO http.Http - http.proxy.port = 8080 >> 2011-12-19 20:13:59,043 INFO http.Http - http.timeout = 10000 >> 2011-12-19 20:13:59,043 INFO http.Http - http.content.limit = -1 >> 2011-12-19 20:13:59,043 INFO fetcher.Fetcher - -finishing thread >> FetcherThread, activeThreads=1 >> 2011-12-19 20:13:59,043 INFO http.Http - http.agent = >> google-robot-intelink/Nutch-1.4 (CDDA Crawler; search.intelink.gov; >> [email protected]) >> 2011-12-19 20:13:59,043 INFO http.Http - http.accept.language = >> en-us,en-gb,en;q=0.7,*;q=0.3 >> 2011-12-19 20:13:59,380 INFO fetcher.Fetcher - -finishing thread >> FetcherThread, activeThreads=0 >> 2011-12-19 20:14:00,050 INFO fetcher.Fetcher - -activeThreads=0, >> spinWaiting=0, fetchQueues.totalSize=0 >> 2011-12-19 20:14:00,050 INFO fetcher.Fetcher - -activeThreads=0 >> 2011-12-19 20:14:00,372 WARN util.NativeCodeLoader - Unable to load >> native-hadoop library for your platform... using builtin-java classes >> where applicable >> 2011-12-19 20:14:01,451 INFO fetcher.Fetcher - Fetcher: finished at >> 2011-12-19 20:14:01, elapsed: 00:00:03 >> 2011-12-19 20:14:02,197 INFO parse.ParseSegment - ParseSegment: >> starting at 2011-12-19 20:14:02 >> 2011-12-19 20:14:02,198 INFO parse.ParseSegment - ParseSegment: >> segment: /cdda/nutch/crawl/segments/20111219201355 >> 2011-12-19 20:14:03,062 WARN util.NativeCodeLoader - Unable to load >> native-hadoop library for your platform... using builtin-java classes >> where applicable >> 2 >> ...is that enough for the fetch logs? It's all crawl/generator >> messages after that. >> >> >> I ran: >> ./nutch freegen ../urls/ ./test-segments >> ./nutch readseg -dump ./test-segments/ ./segment-output >> >> I got an error: >> Exception in thread "main" >> org.apache.hadoop.mapred.InvalidInputException: Input path does not >> exist: file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_generate >> Input path does not exist: >> file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_fetch >> Input path does not exist: >> file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_parse >> Input path does not exist: >> file:/data/search/cdda/nutch-1.4/bin/test-segments/content >> Input path does not exist: >> file:/data/search/cdda/nutch-1.4/bin/test-segments/parse_data >> Input path does not exist: >> file:/data/search/cdda/nutch-1.4/bin/test-segments/parse_text >> at >> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1 >> 90) at >> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn >> putFormat.java:44) at >> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:20 >> 1) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) >> at >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) >> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at >> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at >> org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:225) at >> org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:564) >> >> So do I need to run the generator step in the middle? How is this >> different than just doing a crawl? >> >> Thanks! >> >> -- Chris >> >> >> >> On Mon, Dec 19, 2011 at 3:22 PM, Markus Jelsma >> >> <[email protected]> wrote: >> >> I'm a little confused -- should I set up a whole other instance of >> >> nutch, crawldb, etc? >> > >> > Yes, i use clean instances for quick testing. Makes things easy >> > sometimes. >> > >> >> Set the log to trace, I think this helps to tell why..... >> >> >> >> 2011-12-19 20:14:10,716 INFO crawl.FetchScheduleFactory - Using >> >> FetchSchedule impl: >> >> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,716 >> >> INFO crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19 >> >> 20:14:10,716 INFO crawl.AbstractFetchSchedule - >> >> maxInterval=77760002011-12-19 20:14:10,738 DEBUG crawl.Generator - >> >> -shouldFetch rejected 'http://url/Alpha.docx', >> >> fetchTime=1328213639379, curTime=13247576493782011-12-19 20:14:10,843 >> >> INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: >> >> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,843 >> >> INFO crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19 >> >> 20:14:10,843 INFO crawl.AbstractFetchSchedule - >> >> maxInterval=77760002011-12-19 20:14:11,145 WARN crawl.Generator - >> >> Generator: 0 records selected for fetching, exiting ... >> > >> > Now, this is the generator indeed but you need to fetcher logs. >> > >> >> Now, before I ran this I cleared the crawldb, linkdb & segments, but I >> >> still got a rejected because it is before the next fetch time...why do >> >> I get that? How do I set it up to always crawl all the docs? (Not >> >> practical for production, but it's what I want when testing...) >> > >> > As i said, create segments using the freegen tool. It takes an input dir >> > with seed files, just as your initial inject. Or can also inject files >> > and give them meta data with a very low fetch interval so Nutch will >> > crawl it each time, i usually take this approach in small tests. >> > >> > http://url<TAB>nutch.fetchInterval=10 >> > >> > The URL will be selected by the generator all the time because of this >> > low fetch interval. >> > >> >> -- Chris >> >> >> >> >> >> >> >> On Mon, Dec 19, 2011 at 2:42 PM, Markus Jelsma >> >> >> >> <[email protected]> wrote: >> >> >> > Hmm, the status db_gone prevents it from being indexed, of course. >> >> >> > It is perfectly possible for the checkers to pass but that the >> >> >> > fetcher will fail. There may have been an error and i remeber you >> >> >> > using a proxy earlier, that's likely the problem here too. The >> >> >> > checkers don't use proxy configurations. >> >> >> > >> >> >> > Check the logs to make sure. >> >> >> >> >> >> I cut out the proxy, and that let me get as far as I have now. >> >> >> Having that in place prevents me from crawling the local >> >> >> source...is there any way to be able to crawl both the inside & >> >> >> outside networks? [separate issue, but something that I'll need this >> >> >> to do] >> >> > >> >> > Not that i know of. You can use separate configs but this is tricky. >> >> > Better use separate crawldb's configs etc. >> >> > >> >> >> > That's good. But remember, to pass it _must_ match regex prefixed >> >> >> > by a +. This, however, is not your problem because in that case it >> >> >> > wouldn't have ended up in the CrawlDB at all. >> >> >> >> >> >> I have two +'s that it should match on, including +.* >> >> > >> >> > That'll do. >> >> > >> >> >> > Check the fetcher output thoroughly. Grep around. You should find >> >> >> > it. >> >> >> >> >> >> What exactly am I grepping for? >> >> >> This is the block between the doc and the next one that it tries to >> >> >> crawl.... >> >> > >> >> > Hmm, that looks fine but can still indicate a 404 because a 404 is not >> >> > an error. Does debug say anything? You can set the level for the >> >> > Fetcher in conf/log4j.properties. You can use freegen tool to >> >> > generate a segments from some input text for tests. >> >> > >> >> >> 2011-12-19 18:42:19,538 INFO fetcher.Fetcher - fetching >> >> >> http://url/Alpha.docx 2011-12-19 18:42:19,538 INFO fetcher.Fetcher - >> >> >> Using queue mode : byHost 2011-12-19 18:42:19,539 INFO >> >> >> fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,540 >> >> >> INFO >> >> >> fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,541 >> >> >> INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19 >> >> >> 18:42:19,541 INFO fetcher.Fetcher - Using queue mode : byHost >> >> >> 2011-12-19 18:42:19,541 INFO fetcher.Fetcher - Using queue mode : >> >> >> byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - Using queue >> >> >> mode >> >> >> >> >> >> : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - Using queue >> >> >> >> >> >> mode : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - >> >> >> Fetcher: throughput threshold: -1 >> >> >> 2011-12-19 18:42:19,543 INFO fetcher.Fetcher - Fetcher: throughput >> >> >> threshold retries: 5 >> >> >> 2011-12-19 18:42:19,545 INFO http.Http - http.proxy.host = null >> >> >> 2011-12-19 18:42:19,545 INFO http.Http - http.proxy.port = 8080 >> >> >> 2011-12-19 18:42:19,545 INFO http.Http - http.timeout = 10000 >> >> >> 2011-12-19 18:42:19,545 INFO http.Http - http.content.limit = -1 >> >> >> 2011-12-19 18:42:19,545 INFO http.Http - http.agent = >> >> >> crawler-nutch/Nutch-1.4 (Crawler; [email protected]) >> >> >> 2011-12-19 18:42:19,545 INFO http.Http - http.accept.language = >> >> >> en-us,en-gb,en;q=0.7,*;q=0.3 >> >> >> 2011-12-19 18:42:20,545 INFO fetcher.Fetcher - -activeThreads=10, >> >> >> spinWaiting=10, fetchQueues.totalSize=13 >> >> >> 2011-12-19 18:42:21,548 INFO fetcher.Fetcher - -activeThreads=10, >> >> >> spinWaiting=10, fetchQueues.totalSize=13 >> >> >> 2011-12-19 18:42:22,550 INFO fetcher.Fetcher - -activeThreads=10, >> >> >> spinWaiting=10, fetchQueues.totalSize=13 >> >> >> 2011-12-19 18:42:23,552 INFO fetcher.Fetcher - -activeThreads=10, >> >> >> spinWaiting=10, fetchQueues.totalSize=13 >> >> >> 2011-12-19 18:42:24,554 INFO fetcher.Fetcher - -activeThreads=10, >> >> >> spinWaiting=10, fetchQueues.totalSize=13 >> >> >> >> >> >> Thanks! >> >> >> >> >> >> --Chris

