Re: Missing document

Christopher Gross Tue, 20 Dec 2011 05:12:03 -0800

I don't think it's a redirect, unless SharePoint made it one.  Any
idea how to check for that?


-- Chris



On Mon, Dec 19, 2011 at 5:15 PM, Markus Jelsma
<[email protected]> wrote:
> Half-way, it's clear in the log. Is your document a redirect, i've not yet
> seen such a log line before.
>
> * haven't double-checked source code
>
>
>
>> Not sure where fetching starts...
>>
>> 2011-12-19 20:13:53,223 INFO  crawl.FetchScheduleFactory - Using
>> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>> 2011-12-19 20:13:53,223 INFO  crawl.AbstractFetchSchedule -
>> defaultInterval=2592000
>> 2011-12-19 20:13:53,223 INFO  crawl.AbstractFetchSchedule -
>> maxInterval=7776000 2011-12-19 20:13:53,261 INFO  regex.RegexURLNormalizer
>> - can't find rules for scope 'partition', using default
>> 2011-12-19 20:13:53,394 INFO  crawl.FetchScheduleFactory - Using
>> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>> 2011-12-19 20:13:53,395 INFO  crawl.AbstractFetchSchedule -
>> defaultInterval=2592000
>> 2011-12-19 20:13:53,395 INFO  crawl.AbstractFetchSchedule -
>> maxInterval=7776000 2011-12-19 20:13:53,399 INFO  regex.RegexURLNormalizer
>> - can't find rules for scope 'generate_host_count', using default
>> 2011-12-19 20:13:54,474 INFO  crawl.Generator - Generator:
>> Partitioning selected urls for politeness.
>> 2011-12-19 20:13:55,479 INFO  crawl.Generator - Generator: segment:
>> /cdda/nutch/crawl/segments/20111219201355
>> 2011-12-19 20:13:56,537 INFO  regex.RegexURLNormalizer - can't find
>> rules for scope 'partition', using default
>> 2011-12-19 20:13:56,939 INFO  crawl.Generator - Generator: finished at
>> 2011-12-19 20:13:56, elapsed: 00:00:05
>> 2011-12-19 20:13:57,695 INFO  fetcher.Fetcher - Fetcher: starting at
>> 2011-12-19 20:13:57
>> 2011-12-19 20:13:57,695 INFO  fetcher.Fetcher - Fetcher: segment:
>> /nutch/crawl/segments/20111219201355
>> 2011-12-19 20:13:58,743 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:58,744 INFO  fetcher.Fetcher - Fetcher: threads: 10
>> 2011-12-19 20:13:58,744 INFO  fetcher.Fetcher - Fetcher: time-out divisor:
>> 2 2011-12-19 20:13:58,749 DEBUG fetcher.Fetcher - -feeding 500 input urls
>> ... 2011-12-19 20:13:58,756 INFO  plugin.PluginRepository - Plugins:
>> looking in: /nutch/plugins
>> 2011-12-19 20:13:58,774 INFO  fetcher.Fetcher - QueueFeeder finished:
>> total 1 records + hit by time limit :0
>> <cut plugin loader stuff, can push this if you need it>
>> 2011-12-19 20:13:59,036 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,036 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,036 INFO  fetcher.Fetcher - fetching
>> http://wnspstg8o.imostg.intelink.gov/sites/mlogic/Shared
>> Documents/Alpha.docx
>> 2011-12-19 20:13:59,036 DEBUG fetcher.Fetcher - redirectCount=0
>> 2011-12-19 20:13:59,038 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,039 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,039 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,039 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,040 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,040 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,040 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,041 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,041 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,041 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,041 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,042 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,042 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,042 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,042 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,043 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,043 INFO  fetcher.Fetcher - Fetcher: throughput
>> threshold: -1
>> 2011-12-19 20:13:59,043 INFO  fetcher.Fetcher - Fetcher: throughput
>> threshold retries: 5
>> 2011-12-19 20:13:59,043 INFO  http.Http - http.proxy.host = null
>> 2011-12-19 20:13:59,043 INFO  http.Http - http.proxy.port = 8080
>> 2011-12-19 20:13:59,043 INFO  http.Http - http.timeout = 10000
>> 2011-12-19 20:13:59,043 INFO  http.Http - http.content.limit = -1
>> 2011-12-19 20:13:59,043 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,043 INFO  http.Http - http.agent =
>> google-robot-intelink/Nutch-1.4 (CDDA Crawler; search.intelink.gov;
>> [email protected])
>> 2011-12-19 20:13:59,043 INFO  http.Http - http.accept.language =
>> en-us,en-gb,en;q=0.7,*;q=0.3
>> 2011-12-19 20:13:59,380 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=0
>> 2011-12-19 20:14:00,050 INFO  fetcher.Fetcher - -activeThreads=0,
>> spinWaiting=0, fetchQueues.totalSize=0
>> 2011-12-19 20:14:00,050 INFO  fetcher.Fetcher - -activeThreads=0
>> 2011-12-19 20:14:00,372 WARN  util.NativeCodeLoader - Unable to load
>> native-hadoop library for your platform... using builtin-java classes
>> where applicable
>> 2011-12-19 20:14:01,451 INFO  fetcher.Fetcher - Fetcher: finished at
>> 2011-12-19 20:14:01, elapsed: 00:00:03
>> 2011-12-19 20:14:02,197 INFO  parse.ParseSegment - ParseSegment:
>> starting at 2011-12-19 20:14:02
>> 2011-12-19 20:14:02,198 INFO  parse.ParseSegment - ParseSegment:
>> segment: /cdda/nutch/crawl/segments/20111219201355
>> 2011-12-19 20:14:03,062 WARN  util.NativeCodeLoader - Unable to load
>> native-hadoop library for your platform... using builtin-java classes
>> where applicable
>> 2
>> ...is that enough for the fetch logs?  It's all crawl/generator
>> messages after that.
>>
>>
>> I ran:
>> ./nutch freegen ../urls/ ./test-segments
>> ./nutch readseg -dump ./test-segments/ ./segment-output
>>
>> I got an error:
>> Exception in thread "main"
>> org.apache.hadoop.mapred.InvalidInputException: Input path does not
>> exist: file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_generate
>> Input path does not exist:
>> file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_fetch
>> Input path does not exist:
>> file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_parse
>> Input path does not exist:
>> file:/data/search/cdda/nutch-1.4/bin/test-segments/content
>> Input path does not exist:
>> file:/data/search/cdda/nutch-1.4/bin/test-segments/parse_data
>> Input path does not exist:
>> file:/data/search/cdda/nutch-1.4/bin/test-segments/parse_text
>>         at
>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1
>> 90) at
>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn
>> putFormat.java:44) at
>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:20
>> 1) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>> at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at
>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at
>> org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:225) at
>> org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:564)
>>
>> So do I need to run the generator step in the middle?  How is this
>> different than just doing a crawl?
>>
>> Thanks!
>>
>> -- Chris
>>
>>
>>
>> On Mon, Dec 19, 2011 at 3:22 PM, Markus Jelsma
>>
>> <[email protected]> wrote:
>> >> I'm a little confused -- should I set up a whole other instance of
>> >> nutch, crawldb, etc?
>> >
>> > Yes, i use clean instances for quick testing. Makes things easy
>> > sometimes.
>> >
>> >> Set the log to trace, I think this helps to tell why.....
>> >>
>> >> 2011-12-19 20:14:10,716 INFO  crawl.FetchScheduleFactory - Using
>> >> FetchSchedule impl:
>> >> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,716
>> >> INFO  crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
>> >> 20:14:10,716 INFO  crawl.AbstractFetchSchedule -
>> >> maxInterval=77760002011-12-19 20:14:10,738 DEBUG crawl.Generator -
>> >> -shouldFetch rejected 'http://url/Alpha.docx',
>> >> fetchTime=1328213639379, curTime=13247576493782011-12-19 20:14:10,843
>> >> INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl:
>> >> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,843
>> >> INFO  crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
>> >> 20:14:10,843 INFO  crawl.AbstractFetchSchedule -
>> >> maxInterval=77760002011-12-19 20:14:11,145 WARN  crawl.Generator -
>> >> Generator: 0 records selected for fetching, exiting ...
>> >
>> > Now, this is the generator indeed but you need to fetcher logs.
>> >
>> >> Now, before I ran this I cleared the crawldb, linkdb & segments, but I
>> >> still got a rejected because it is before the next fetch time...why do
>> >> I get that?  How do I set it up to always crawl all the docs?  (Not
>> >> practical for production, but it's what I want when testing...)
>> >
>> > As i said, create segments using the freegen tool. It takes an input dir
>> > with seed files, just as your initial inject. Or can also inject files
>> > and give them meta data with a very low fetch interval so Nutch will
>> > crawl it each time, i usually take this approach in small tests.
>> >
>> > http://url<TAB>nutch.fetchInterval=10
>> >
>> > The URL will be selected by the generator all the time because of this
>> > low fetch interval.
>> >
>> >> -- Chris
>> >>
>> >>
>> >>
>> >> On Mon, Dec 19, 2011 at 2:42 PM, Markus Jelsma
>> >>
>> >> <[email protected]> wrote:
>> >> >> > Hmm, the status db_gone prevents it from being indexed, of course.
>> >> >> > It is perfectly possible for the checkers to pass but that the
>> >> >> > fetcher will fail. There may have been an error and i remeber you
>> >> >> > using a proxy earlier, that's likely the problem here too. The
>> >> >> > checkers don't use proxy configurations.
>> >> >> >
>> >> >> > Check the logs to make sure.
>> >> >>
>> >> >> I cut out the proxy, and that let me get as far as I have now.
>> >> >>  Having that in place prevents me from crawling the local
>> >> >> source...is there any way to be able to crawl both the inside &
>> >> >> outside networks? [separate issue, but something that I'll need this
>> >> >> to do]
>> >> >
>> >> > Not that i know of. You can use separate configs but this is tricky.
>> >> > Better use separate crawldb's configs etc.
>> >> >
>> >> >> > That's good. But remember, to pass it _must_ match regex prefixed
>> >> >> > by a +. This, however, is not your problem because in that case it
>> >> >> > wouldn't have ended up in the CrawlDB at all.
>> >> >>
>> >> >> I have two +'s that it should match on, including +.*
>> >> >
>> >> > That'll do.
>> >> >
>> >> >> > Check the fetcher output thoroughly. Grep around. You should find
>> >> >> > it.
>> >> >>
>> >> >> What exactly am I grepping for?
>> >> >> This is the block between the doc and the next one that it tries to
>> >> >> crawl....
>> >> >
>> >> > Hmm, that looks fine but can still indicate a 404 because a 404 is not
>> >> > an error. Does debug say anything? You can set the level for the
>> >> > Fetcher in conf/log4j.properties. You can use freegen tool to
>> >> > generate a segments from some input text for tests.
>> >> >
>> >> >> 2011-12-19 18:42:19,538 INFO  fetcher.Fetcher - fetching
>> >> >> http://url/Alpha.docx 2011-12-19 18:42:19,538 INFO  fetcher.Fetcher -
>> >> >> Using queue mode : byHost 2011-12-19 18:42:19,539 INFO
>> >> >>  fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,540
>> >> >> INFO
>> >> >>  fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,541
>> >> >> INFO  fetcher.Fetcher - Using queue mode : byHost 2011-12-19
>> >> >> 18:42:19,541 INFO  fetcher.Fetcher - Using queue mode : byHost
>> >> >> 2011-12-19 18:42:19,541 INFO  fetcher.Fetcher - Using queue mode :
>> >> >> byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher - Using queue
>> >> >> mode
>> >> >>
>> >> >> : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher - Using queue
>> >> >>
>> >> >> mode : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher -
>> >> >> Fetcher: throughput threshold: -1
>> >> >> 2011-12-19 18:42:19,543 INFO  fetcher.Fetcher - Fetcher: throughput
>> >> >> threshold retries: 5
>> >> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.host = null
>> >> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.port = 8080
>> >> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.timeout = 10000
>> >> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.content.limit = -1
>> >> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.agent =
>> >> >> crawler-nutch/Nutch-1.4 (Crawler; [email protected])
>> >> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.accept.language =
>> >> >> en-us,en-gb,en;q=0.7,*;q=0.3
>> >> >> 2011-12-19 18:42:20,545 INFO  fetcher.Fetcher - -activeThreads=10,
>> >> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> >> 2011-12-19 18:42:21,548 INFO  fetcher.Fetcher - -activeThreads=10,
>> >> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> >> 2011-12-19 18:42:22,550 INFO  fetcher.Fetcher - -activeThreads=10,
>> >> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> >> 2011-12-19 18:42:23,552 INFO  fetcher.Fetcher - -activeThreads=10,
>> >> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> >> 2011-12-19 18:42:24,554 INFO  fetcher.Fetcher - -activeThreads=10,
>> >> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >> --Chris

Re: Missing document

Reply via email to