Re: Missing document

Christopher Gross Mon, 19 Dec 2011 13:01:44 -0800

Not sure where fetching starts...

2011-12-19 20:13:53,223 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2011-12-19 20:13:53,223 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2011-12-19 20:13:53,223 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2011-12-19 20:13:53,261 INFO  regex.RegexURLNormalizer - can't find
rules for scope 'partition', using default
2011-12-19 20:13:53,394 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2011-12-19 20:13:53,395 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2011-12-19 20:13:53,395 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2011-12-19 20:13:53,399 INFO  regex.RegexURLNormalizer - can't find
rules for scope 'generate_host_count', using default
2011-12-19 20:13:54,474 INFO  crawl.Generator - Generator:
Partitioning selected urls for politeness.
2011-12-19 20:13:55,479 INFO  crawl.Generator - Generator: segment:
/cdda/nutch/crawl/segments/20111219201355
2011-12-19 20:13:56,537 INFO  regex.RegexURLNormalizer - can't find
rules for scope 'partition', using default
2011-12-19 20:13:56,939 INFO  crawl.Generator - Generator: finished at
2011-12-19 20:13:56, elapsed: 00:00:05
2011-12-19 20:13:57,695 INFO  fetcher.Fetcher - Fetcher: starting at
2011-12-19 20:13:57
2011-12-19 20:13:57,695 INFO  fetcher.Fetcher - Fetcher: segment:
/nutch/crawl/segments/20111219201355
2011-12-19 20:13:58,743 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:58,744 INFO  fetcher.Fetcher - Fetcher: threads: 10
2011-12-19 20:13:58,744 INFO  fetcher.Fetcher - Fetcher: time-out divisor: 2
2011-12-19 20:13:58,749 DEBUG fetcher.Fetcher - -feeding 500 input urls ...
2011-12-19 20:13:58,756 INFO  plugin.PluginRepository - Plugins:
looking in: /nutch/plugins
2011-12-19 20:13:58,774 INFO  fetcher.Fetcher - QueueFeeder finished:
total 1 records + hit by time limit :0
<cut plugin loader stuff, can push this if you need it>
2011-12-19 20:13:59,036 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,036 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,036 INFO  fetcher.Fetcher - fetching
http://wnspstg8o.imostg.intelink.gov/sites/mlogic/Shared
Documents/Alpha.docx
2011-12-19 20:13:59,036 DEBUG fetcher.Fetcher - redirectCount=0
2011-12-19 20:13:59,038 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,039 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,039 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,039 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,040 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,040 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,040 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,041 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,041 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,041 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,041 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,042 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,042 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,042 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,042 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,043 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,043 INFO  fetcher.Fetcher - Fetcher: throughput
threshold: -1
2011-12-19 20:13:59,043 INFO  fetcher.Fetcher - Fetcher: throughput
threshold retries: 5
2011-12-19 20:13:59,043 INFO  http.Http - http.proxy.host = null
2011-12-19 20:13:59,043 INFO  http.Http - http.proxy.port = 8080
2011-12-19 20:13:59,043 INFO  http.Http - http.timeout = 10000
2011-12-19 20:13:59,043 INFO  http.Http - http.content.limit = -1
2011-12-19 20:13:59,043 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,043 INFO  http.Http - http.agent =
google-robot-intelink/Nutch-1.4 (CDDA Crawler; search.intelink.gov;
[email protected])
2011-12-19 20:13:59,043 INFO  http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2011-12-19 20:13:59,380 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=0
2011-12-19 20:14:00,050 INFO  fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2011-12-19 20:14:00,050 INFO  fetcher.Fetcher - -activeThreads=0
2011-12-19 20:14:00,372 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
2011-12-19 20:14:01,451 INFO  fetcher.Fetcher - Fetcher: finished at
2011-12-19 20:14:01, elapsed: 00:00:03
2011-12-19 20:14:02,197 INFO  parse.ParseSegment - ParseSegment:
starting at 2011-12-19 20:14:02
2011-12-19 20:14:02,198 INFO  parse.ParseSegment - ParseSegment:
segment: /cdda/nutch/crawl/segments/20111219201355
2011-12-19 20:14:03,062 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
2
...is that enough for the fetch logs?  It's all crawl/generator
messages after that.



I ran:
./nutch freegen ../urls/ ./test-segments
./nutch readseg -dump ./test-segments/ ./segment-output

I got an error:
Exception in thread "main"
org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_generate
Input path does not exist:
file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_fetch
Input path does not exist:
file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_parse
Input path does not exist:
file:/data/search/cdda/nutch-1.4/bin/test-segments/content
Input path does not exist:
file:/data/search/cdda/nutch-1.4/bin/test-segments/parse_data
Input path does not exist:
file:/data/search/cdda/nutch-1.4/bin/test-segments/parse_text
        at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
        at 
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
        at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
        at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
        at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
        at org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:225)
        at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:564)

So do I need to run the generator step in the middle?  How is this
different than just doing a crawl?

Thanks!

-- Chris



On Mon, Dec 19, 2011 at 3:22 PM, Markus Jelsma
<[email protected]> wrote:
>
>> I'm a little confused -- should I set up a whole other instance of
>> nutch, crawldb, etc?
>
> Yes, i use clean instances for quick testing. Makes things easy sometimes.
>
>>
>> Set the log to trace, I think this helps to tell why.....
>>
>> 2011-12-19 20:14:10,716 INFO  crawl.FetchScheduleFactory - Using
>> FetchSchedule impl:
>> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,716
>> INFO  crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
>> 20:14:10,716 INFO  crawl.AbstractFetchSchedule -
>> maxInterval=77760002011-12-19 20:14:10,738 DEBUG crawl.Generator -
>> -shouldFetch rejected 'http://url/Alpha.docx',
>> fetchTime=1328213639379, curTime=13247576493782011-12-19 20:14:10,843
>> INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl:
>> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,843
>> INFO  crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
>> 20:14:10,843 INFO  crawl.AbstractFetchSchedule -
>> maxInterval=77760002011-12-19 20:14:11,145 WARN  crawl.Generator -
>> Generator: 0 records selected for fetching, exiting ...
>
> Now, this is the generator indeed but you need to fetcher logs.
>
>> Now, before I ran this I cleared the crawldb, linkdb & segments, but I
>> still got a rejected because it is before the next fetch time...why do
>> I get that?  How do I set it up to always crawl all the docs?  (Not
>> practical for production, but it's what I want when testing...)
>
> As i said, create segments using the freegen tool. It takes an input dir with
> seed files, just as your initial inject. Or can also inject files and give
> them meta data with a very low fetch interval so Nutch will crawl it each
> time, i usually take this approach in small tests.
>
> http://url<TAB>nutch.fetchInterval=10
>
> The URL will be selected by the generator all the time because of this low
> fetch interval.
>
>> -- Chris
>>
>>
>>
>> On Mon, Dec 19, 2011 at 2:42 PM, Markus Jelsma
>>
>> <[email protected]> wrote:
>> >> > Hmm, the status db_gone prevents it from being indexed, of course. It
>> >> > is perfectly possible for the checkers to pass but that the fetcher
>> >> > will fail. There may have been an error and i remeber you using a
>> >> > proxy earlier, that's likely the problem here too. The checkers don't
>> >> > use proxy configurations.
>> >> >
>> >> > Check the logs to make sure.
>> >>
>> >> I cut out the proxy, and that let me get as far as I have now.  Having
>> >> that in place prevents me from crawling the local source...is there
>> >> any way to be able to crawl both the inside & outside networks?
>> >> [separate issue, but something that I'll need this to do]
>> >
>> > Not that i know of. You can use separate configs but this is tricky.
>> > Better use separate crawldb's configs etc.
>> >
>> >> > That's good. But remember, to pass it _must_ match regex prefixed by a
>> >> > +. This, however, is not your problem because in that case it
>> >> > wouldn't have ended up in the CrawlDB at all.
>> >>
>> >> I have two +'s that it should match on, including +.*
>> >
>> > That'll do.
>> >
>> >> > Check the fetcher output thoroughly. Grep around. You should find it.
>> >>
>> >> What exactly am I grepping for?
>> >> This is the block between the doc and the next one that it tries to
>> >> crawl....
>> >
>> > Hmm, that looks fine but can still indicate a 404 because a 404 is not an
>> > error. Does debug say anything? You can set the level for the Fetcher in
>> > conf/log4j.properties. You can use freegen tool to generate a segments
>> > from some input text for tests.
>> >
>> >> 2011-12-19 18:42:19,538 INFO  fetcher.Fetcher - fetching
>> >> http://url/Alpha.docx 2011-12-19 18:42:19,538 INFO  fetcher.Fetcher -
>> >> Using queue mode : byHost 2011-12-19 18:42:19,539 INFO  fetcher.Fetcher
>> >> - Using queue mode : byHost 2011-12-19 18:42:19,540 INFO
>> >>  fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,541
>> >> INFO  fetcher.Fetcher - Using queue mode : byHost 2011-12-19
>> >> 18:42:19,541 INFO  fetcher.Fetcher - Using queue mode : byHost
>> >> 2011-12-19 18:42:19,541 INFO  fetcher.Fetcher - Using queue mode :
>> >> byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher - Using queue mode
>> >> : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher - Using queue
>> >> mode : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher - Fetcher:
>> >> throughput threshold: -1
>> >> 2011-12-19 18:42:19,543 INFO  fetcher.Fetcher - Fetcher: throughput
>> >> threshold retries: 5
>> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.host = null
>> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.port = 8080
>> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.timeout = 10000
>> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.content.limit = -1
>> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.agent =
>> >> crawler-nutch/Nutch-1.4 (Crawler; [email protected])
>> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.accept.language =
>> >> en-us,en-gb,en;q=0.7,*;q=0.3
>> >> 2011-12-19 18:42:20,545 INFO  fetcher.Fetcher - -activeThreads=10,
>> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> 2011-12-19 18:42:21,548 INFO  fetcher.Fetcher - -activeThreads=10,
>> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> 2011-12-19 18:42:22,550 INFO  fetcher.Fetcher - -activeThreads=10,
>> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> 2011-12-19 18:42:23,552 INFO  fetcher.Fetcher - -activeThreads=10,
>> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> 2011-12-19 18:42:24,554 INFO  fetcher.Fetcher - -activeThreads=10,
>> >> spinWaiting=10, fetchQueues.totalSize=13
>> >>
>> >> Thanks!
>> >>
>> >> --Chris

Re: Missing document

Reply via email to