Hi, I made simple example
Put in seed.txt http://mobile365.ru It will produce error. Put in seed.txt http://www.mobile365.ru and second launch of crawler script will work fine and fetch http://www.mobile365.ru/test.html page. On Tue, Aug 7, 2012 at 6:23 PM, Mathijs Homminga < mathijs.hommi...@kalooga.com> wrote: > Hi, > > I read from your logs: > - test.com is injected. > - test.com is fetched and parsed successfully. > - but when you run a generate again (second launch), no segment is created > (because no url is selected) and your script tries to fetch and parse the > first segment again. Hence the errors. > > So test.com is fetched successfully. Question remains: why is no url > selected in the second generate? > Many answers possible. Can you tell us what urls you have in your crawldb > after the first cycle? Perhaps no outlinks have been found / added. > > Mathijs > > > > > On Aug 7, 2012, at 16:02 , Alexei Korolev <alexei.koro...@gmail.com> > wrote: > > > Hello, > > > > Yes, test.com and www.test.com exist. > > test.com do not redirect on www.test.com, it opens page with ongoing > link > > with www. like www.test.com/page1 www.test.com/page2 > > > > First launch of crawler script > > > > root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh > > Injector: starting at 2012-08-07 16:00:30 > > Injector: crawlDb: crawl/crawldb > > Injector: urlDir: seed.txt > > Injector: Converting injected urls to crawl db entries. > > Injector: Merging injected urls into crawl db. > > Injector: finished at 2012-08-07 16:00:32, elapsed: 00:00:02 > > Generator: starting at 2012-08-07 16:00:33 > > Generator: Selecting best-scoring urls due for fetch. > > Generator: filtering: true > > Generator: normalizing: true > > Generator: jobtracker is 'local', generating exactly one partition. > > Generator: Partitioning selected urls for politeness. > > Generator: segment: crawl/crawldb/segments/20120807160035 > > Generator: finished at 2012-08-07 16:00:36, elapsed: 00:00:03 > > Fetcher: Your 'http.agent.name' value should be listed first in > > 'http.robots.agents' property. > > Fetcher: starting at 2012-08-07 16:00:37 > > Fetcher: segment: crawl/crawldb/segments/20120807160035 > > Using queue mode : byHost > > Fetcher: threads: 10 > > Fetcher: time-out divisor: 2 > > QueueFeeder finished: total 1 records + hit by time limit :0 > > Using queue mode : byHost > > Using queue mode : byHost > > -finishing thread FetcherThread, activeThreads=1 > > Using queue mode : byHost > > fetching http://test.com > > -finishing thread FetcherThread, activeThreads=1 > > Using queue mode : byHost > > -finishing thread FetcherThread, activeThreads=1 > > Using queue mode : byHost > > -finishing thread FetcherThread, activeThreads=1 > > Using queue mode : byHost > > -finishing thread FetcherThread, activeThreads=1 > > Using queue mode : byHost > > -finishing thread FetcherThread, activeThreads=1 > > Using queue mode : byHost > > -finishing thread FetcherThread, activeThreads=1 > > Using queue mode : byHost > > -finishing thread FetcherThread, activeThreads=1 > > Using queue mode : byHost > > -finishing thread FetcherThread, activeThreads=1 > > Fetcher: throughput threshold: -1 > > Fetcher: throughput threshold retries: 5 > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > > -finishing thread FetcherThread, activeThreads=0 > > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > > -activeThreads=0 > > Fetcher: finished at 2012-08-07 16:00:41, elapsed: 00:00:04 > > ParseSegment: starting at 2012-08-07 16:00:41 > > ParseSegment: segment: crawl/crawldb/segments/20120807160035 > > Parsing: http://test.com > > ParseSegment: finished at 2012-08-07 16:00:44, elapsed: 00:00:02 > > CrawlDb update: starting at 2012-08-07 16:00:44 > > CrawlDb update: db: crawl/crawldb > > CrawlDb update: segments: [crawl/crawldb/segments/20120807160035] > > CrawlDb update: additions allowed: true > > CrawlDb update: URL normalizing: false > > CrawlDb update: URL filtering: false > > CrawlDb update: 404 purging: false > > CrawlDb update: Merging segment data into db. > > CrawlDb update: finished at 2012-08-07 16:00:45, elapsed: 00:00:01 > > LinkDb: starting at 2012-08-07 16:00:46 > > LinkDb: linkdb: crawl/crawldb/linkdb > > LinkDb: URL normalize: true > > LinkDb: URL filter: true > > LinkDb: adding segment: > > file:/data/nutch/crawl/crawldb/segments/20120807160035 > > LinkDb: finished at 2012-08-07 16:00:47, elapsed: 00:00:01 > > > > Second launch of srcipt > > > > root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh > > Injector: starting at 2012-08-07 16:01:30 > > Injector: crawlDb: crawl/crawldb > > Injector: urlDir: seed.txt > > Injector: Converting injected urls to crawl db entries. > > Injector: Merging injected urls into crawl db. > > Injector: finished at 2012-08-07 16:01:32, elapsed: 00:00:02 > > Generator: starting at 2012-08-07 16:01:33 > > Generator: Selecting best-scoring urls due for fetch. > > Generator: filtering: true > > Generator: normalizing: true > > Generator: jobtracker is 'local', generating exactly one partition. > > Generator: 0 records selected for fetching, exiting ... > > Fetcher: Your 'http.agent.name' value should be listed first in > > 'http.robots.agents' property. > > Fetcher: starting at 2012-08-07 16:01:35 > > Fetcher: segment: crawl/crawldb/segments/20120807160035 > > Fetcher: java.io.IOException: Segment already fetched! > > at > > > org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:58) > > at > > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772) > > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) > > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1204) > > at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1240) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1213) > > > > ParseSegment: starting at 2012-08-07 16:01:35 > > ParseSegment: segment: crawl/crawldb/segments/20120807160035 > > Exception in thread "main" java.io.IOException: Segment already parsed! > > at > > > org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:87) > > at > > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772) > > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) > > at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157) > > at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164) > > CrawlDb update: starting at 2012-08-07 16:01:36 > > CrawlDb update: db: crawl/crawldb > > CrawlDb update: segments: [crawl/crawldb/segments/20120807160035] > > CrawlDb update: additions allowed: true > > CrawlDb update: URL normalizing: false > > CrawlDb update: URL filtering: false > > CrawlDb update: 404 purging: false > > CrawlDb update: Merging segment data into db. > > CrawlDb update: finished at 2012-08-07 16:01:37, elapsed: 00:00:01 > > LinkDb: starting at 2012-08-07 16:01:37 > > LinkDb: linkdb: crawl/crawldb/linkdb > > LinkDb: URL normalize: true > > LinkDb: URL filter: true > > LinkDb: adding segment: > > file:/data/nutch/crawl/crawldb/segments/20120807160035 > > LinkDb: merging with existing linkdb: crawl/crawldb/linkdb > > LinkDb: finished at 2012-08-07 16:01:40, elapsed: 00:00:02 > > > > > > But when seed.txt have www.test.com instead test.com second launch of > > crawler script found next segment for fetching. > > > > On Sat, Aug 4, 2012 at 7:33 PM, Mathijs Homminga < > > mathijs.hommi...@kalooga.com> wrote: > > > >> What do you mean exactly with "it falls on fetch phase"? > >> Do you get an error? > >> Does "test.com" exist? > >> Does it perhaps redirect to "www.test.com"? > >> ... > >> > >> Mathijs > >> > >> > >> On Aug 4, 2012, at 17:11 , Alexei Korolev <alexei.koro...@gmail.com> > >> wrote: > >> > >>> yes > >>> > >>> On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney < > >>> lewis.mcgibb...@gmail.com> wrote: > >>> > >>>> http:// ? > >>>> > >>>> hth > >>>> > >>>> On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev < > >> alexei.koro...@gmail.com> > >>>> wrote: > >>>>> Hello, > >>>>> > >>>>> I have small script > >>>>> > >>>>> $NUTCH_PATH inject crawl/crawldb seed.txt > >>>>> $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0 > >>>>> > >>>>> s1=`ls -d crawl/crawldb/segments/* | tail -1` > >>>>> $NUTCH_PATH fetch $s1 > >>>>> $NUTCH_PATH parse $s1 > >>>>> $NUTCH_PATH updatedb crawl/crawldb $s1 > >>>>> > >>>>> In seed.txt I have just one site, for example "test.com". When I > start > >>>>> script it falls on fetch phase. > >>>>> If I change test.com on www.test.com it works fine. Seems the > reason, > >>>> that > >>>>> outgoing link on test.com all have www. prefix. > >>>>> What I need to change in nutch config for work with test.com? > >>>>> > >>>>> Thank you in advance. I hope my explanation is clear :) > >>>>> > >>>>> -- > >>>>> Alexei A. Korolev > >>>> > >>>> > >>>> > >>>> -- > >>>> Lewis > >>>> > >>> > >>> > >>> > >>> -- > >>> Alexei A. Korolev > >> > >> > > > > > > -- > > Alexei A. Korolev > > -- Alexei A. Korolev