Re: Missing document

Markus Jelsma Mon, 19 Dec 2011 14:17:07 -0800

Half-way, it's clear in the log. Is your document a redirect, i've not yet 
seen such a log line before.


* haven't double-checked source code



> Not sure where fetching starts...
> 
> 2011-12-19 20:13:53,223 INFO  crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2011-12-19 20:13:53,223 INFO  crawl.AbstractFetchSchedule -
> defaultInterval=2592000
> 2011-12-19 20:13:53,223 INFO  crawl.AbstractFetchSchedule -
> maxInterval=7776000 2011-12-19 20:13:53,261 INFO  regex.RegexURLNormalizer
> - can't find rules for scope 'partition', using default
> 2011-12-19 20:13:53,394 INFO  crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2011-12-19 20:13:53,395 INFO  crawl.AbstractFetchSchedule -
> defaultInterval=2592000
> 2011-12-19 20:13:53,395 INFO  crawl.AbstractFetchSchedule -
> maxInterval=7776000 2011-12-19 20:13:53,399 INFO  regex.RegexURLNormalizer
> - can't find rules for scope 'generate_host_count', using default
> 2011-12-19 20:13:54,474 INFO  crawl.Generator - Generator:
> Partitioning selected urls for politeness.
> 2011-12-19 20:13:55,479 INFO  crawl.Generator - Generator: segment:
> /cdda/nutch/crawl/segments/20111219201355
> 2011-12-19 20:13:56,537 INFO  regex.RegexURLNormalizer - can't find
> rules for scope 'partition', using default
> 2011-12-19 20:13:56,939 INFO  crawl.Generator - Generator: finished at
> 2011-12-19 20:13:56, elapsed: 00:00:05
> 2011-12-19 20:13:57,695 INFO  fetcher.Fetcher - Fetcher: starting at
> 2011-12-19 20:13:57
> 2011-12-19 20:13:57,695 INFO  fetcher.Fetcher - Fetcher: segment:
> /nutch/crawl/segments/20111219201355
> 2011-12-19 20:13:58,743 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:58,744 INFO  fetcher.Fetcher - Fetcher: threads: 10
> 2011-12-19 20:13:58,744 INFO  fetcher.Fetcher - Fetcher: time-out divisor:
> 2 2011-12-19 20:13:58,749 DEBUG fetcher.Fetcher - -feeding 500 input urls
> ... 2011-12-19 20:13:58,756 INFO  plugin.PluginRepository - Plugins:
> looking in: /nutch/plugins
> 2011-12-19 20:13:58,774 INFO  fetcher.Fetcher - QueueFeeder finished:
> total 1 records + hit by time limit :0
> <cut plugin loader stuff, can push this if you need it>
> 2011-12-19 20:13:59,036 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,036 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,036 INFO  fetcher.Fetcher - fetching
> http://wnspstg8o.imostg.intelink.gov/sites/mlogic/Shared
> Documents/Alpha.docx
> 2011-12-19 20:13:59,036 DEBUG fetcher.Fetcher - redirectCount=0
> 2011-12-19 20:13:59,038 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,039 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,039 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,039 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,040 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,040 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,040 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,041 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,041 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,041 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,041 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,042 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,042 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,042 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,042 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,043 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,043 INFO  fetcher.Fetcher - Fetcher: throughput
> threshold: -1
> 2011-12-19 20:13:59,043 INFO  fetcher.Fetcher - Fetcher: throughput
> threshold retries: 5
> 2011-12-19 20:13:59,043 INFO  http.Http - http.proxy.host = null
> 2011-12-19 20:13:59,043 INFO  http.Http - http.proxy.port = 8080
> 2011-12-19 20:13:59,043 INFO  http.Http - http.timeout = 10000
> 2011-12-19 20:13:59,043 INFO  http.Http - http.content.limit = -1
> 2011-12-19 20:13:59,043 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,043 INFO  http.Http - http.agent =
> google-robot-intelink/Nutch-1.4 (CDDA Crawler; search.intelink.gov;
> [email protected])
> 2011-12-19 20:13:59,043 INFO  http.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2011-12-19 20:13:59,380 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=0
> 2011-12-19 20:14:00,050 INFO  fetcher.Fetcher - -activeThreads=0,
> spinWaiting=0, fetchQueues.totalSize=0
> 2011-12-19 20:14:00,050 INFO  fetcher.Fetcher - -activeThreads=0
> 2011-12-19 20:14:00,372 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes
> where applicable
> 2011-12-19 20:14:01,451 INFO  fetcher.Fetcher - Fetcher: finished at
> 2011-12-19 20:14:01, elapsed: 00:00:03
> 2011-12-19 20:14:02,197 INFO  parse.ParseSegment - ParseSegment:
> starting at 2011-12-19 20:14:02
> 2011-12-19 20:14:02,198 INFO  parse.ParseSegment - ParseSegment:
> segment: /cdda/nutch/crawl/segments/20111219201355
> 2011-12-19 20:14:03,062 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes
> where applicable
> 2
> ...is that enough for the fetch logs?  It's all crawl/generator
> messages after that.
> 
> 
> I ran:
> ./nutch freegen ../urls/ ./test-segments
> ./nutch readseg -dump ./test-segments/ ./segment-output
> 
> I got an error:
> Exception in thread "main"
> org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist: file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_generate
> Input path does not exist:
> file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_fetch
> Input path does not exist:
> file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_parse
> Input path does not exist:
> file:/data/search/cdda/nutch-1.4/bin/test-segments/content
> Input path does not exist:
> file:/data/search/cdda/nutch-1.4/bin/test-segments/parse_data
> Input path does not exist:
> file:/data/search/cdda/nutch-1.4/bin/test-segments/parse_text
>         at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1
> 90) at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn
> putFormat.java:44) at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:20
> 1) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at
> org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:225) at
> org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:564)
> 
> So do I need to run the generator step in the middle?  How is this
> different than just doing a crawl?
> 
> Thanks!
> 
> -- Chris
> 
> 
> 
> On Mon, Dec 19, 2011 at 3:22 PM, Markus Jelsma
> 
> <[email protected]> wrote:
> >> I'm a little confused -- should I set up a whole other instance of
> >> nutch, crawldb, etc?
> > 
> > Yes, i use clean instances for quick testing. Makes things easy
> > sometimes.
> > 
> >> Set the log to trace, I think this helps to tell why.....
> >> 
> >> 2011-12-19 20:14:10,716 INFO  crawl.FetchScheduleFactory - Using
> >> FetchSchedule impl:
> >> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,716
> >> INFO  crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
> >> 20:14:10,716 INFO  crawl.AbstractFetchSchedule -
> >> maxInterval=77760002011-12-19 20:14:10,738 DEBUG crawl.Generator -
> >> -shouldFetch rejected 'http://url/Alpha.docx',
> >> fetchTime=1328213639379, curTime=13247576493782011-12-19 20:14:10,843
> >> INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl:
> >> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,843
> >> INFO  crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
> >> 20:14:10,843 INFO  crawl.AbstractFetchSchedule -
> >> maxInterval=77760002011-12-19 20:14:11,145 WARN  crawl.Generator -
> >> Generator: 0 records selected for fetching, exiting ...
> > 
> > Now, this is the generator indeed but you need to fetcher logs.
> > 
> >> Now, before I ran this I cleared the crawldb, linkdb & segments, but I
> >> still got a rejected because it is before the next fetch time...why do
> >> I get that?  How do I set it up to always crawl all the docs?  (Not
> >> practical for production, but it's what I want when testing...)
> > 
> > As i said, create segments using the freegen tool. It takes an input dir
> > with seed files, just as your initial inject. Or can also inject files
> > and give them meta data with a very low fetch interval so Nutch will
> > crawl it each time, i usually take this approach in small tests.
> > 
> > http://url<TAB>nutch.fetchInterval=10
> > 
> > The URL will be selected by the generator all the time because of this
> > low fetch interval.
> > 
> >> -- Chris
> >> 
> >> 
> >> 
> >> On Mon, Dec 19, 2011 at 2:42 PM, Markus Jelsma
> >> 
> >> <[email protected]> wrote:
> >> >> > Hmm, the status db_gone prevents it from being indexed, of course.
> >> >> > It is perfectly possible for the checkers to pass but that the
> >> >> > fetcher will fail. There may have been an error and i remeber you
> >> >> > using a proxy earlier, that's likely the problem here too. The
> >> >> > checkers don't use proxy configurations.
> >> >> > 
> >> >> > Check the logs to make sure.
> >> >> 
> >> >> I cut out the proxy, and that let me get as far as I have now.
> >> >>  Having that in place prevents me from crawling the local
> >> >> source...is there any way to be able to crawl both the inside &
> >> >> outside networks? [separate issue, but something that I'll need this
> >> >> to do]
> >> > 
> >> > Not that i know of. You can use separate configs but this is tricky.
> >> > Better use separate crawldb's configs etc.
> >> > 
> >> >> > That's good. But remember, to pass it _must_ match regex prefixed
> >> >> > by a +. This, however, is not your problem because in that case it
> >> >> > wouldn't have ended up in the CrawlDB at all.
> >> >> 
> >> >> I have two +'s that it should match on, including +.*
> >> > 
> >> > That'll do.
> >> > 
> >> >> > Check the fetcher output thoroughly. Grep around. You should find
> >> >> > it.
> >> >> 
> >> >> What exactly am I grepping for?
> >> >> This is the block between the doc and the next one that it tries to
> >> >> crawl....
> >> > 
> >> > Hmm, that looks fine but can still indicate a 404 because a 404 is not
> >> > an error. Does debug say anything? You can set the level for the
> >> > Fetcher in conf/log4j.properties. You can use freegen tool to
> >> > generate a segments from some input text for tests.
> >> > 
> >> >> 2011-12-19 18:42:19,538 INFO  fetcher.Fetcher - fetching
> >> >> http://url/Alpha.docx 2011-12-19 18:42:19,538 INFO  fetcher.Fetcher -
> >> >> Using queue mode : byHost 2011-12-19 18:42:19,539 INFO
> >> >>  fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,540
> >> >> INFO
> >> >>  fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,541
> >> >> INFO  fetcher.Fetcher - Using queue mode : byHost 2011-12-19
> >> >> 18:42:19,541 INFO  fetcher.Fetcher - Using queue mode : byHost
> >> >> 2011-12-19 18:42:19,541 INFO  fetcher.Fetcher - Using queue mode :
> >> >> byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher - Using queue
> >> >> mode
> >> >> 
> >> >> : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher - Using queue
> >> >> 
> >> >> mode : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher -
> >> >> Fetcher: throughput threshold: -1
> >> >> 2011-12-19 18:42:19,543 INFO  fetcher.Fetcher - Fetcher: throughput
> >> >> threshold retries: 5
> >> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.host = null
> >> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.port = 8080
> >> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.timeout = 10000
> >> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.content.limit = -1
> >> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.agent =
> >> >> crawler-nutch/Nutch-1.4 (Crawler; [email protected])
> >> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.accept.language =
> >> >> en-us,en-gb,en;q=0.7,*;q=0.3
> >> >> 2011-12-19 18:42:20,545 INFO  fetcher.Fetcher - -activeThreads=10,
> >> >> spinWaiting=10, fetchQueues.totalSize=13
> >> >> 2011-12-19 18:42:21,548 INFO  fetcher.Fetcher - -activeThreads=10,
> >> >> spinWaiting=10, fetchQueues.totalSize=13
> >> >> 2011-12-19 18:42:22,550 INFO  fetcher.Fetcher - -activeThreads=10,
> >> >> spinWaiting=10, fetchQueues.totalSize=13
> >> >> 2011-12-19 18:42:23,552 INFO  fetcher.Fetcher - -activeThreads=10,
> >> >> spinWaiting=10, fetchQueues.totalSize=13
> >> >> 2011-12-19 18:42:24,554 INFO  fetcher.Fetcher - -activeThreads=10,
> >> >> spinWaiting=10, fetchQueues.totalSize=13
> >> >> 
> >> >> Thanks!
> >> >> 
> >> >> --Chris

Re: Missing document

Reply via email to