Hello, Yes, test.com and www.test.com exist. test.com do not redirect on www.test.com, it opens page with ongoing link with www. like www.test.com/page1 www.test.com/page2
First launch of crawler script root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh Injector: starting at 2012-08-07 16:00:30 Injector: crawlDb: crawl/crawldb Injector: urlDir: seed.txt Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-08-07 16:00:32, elapsed: 00:00:02 Generator: starting at 2012-08-07 16:00:33 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/crawldb/segments/20120807160035 Generator: finished at 2012-08-07 16:00:36, elapsed: 00:00:03 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2012-08-07 16:00:37 Fetcher: segment: crawl/crawldb/segments/20120807160035 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost fetching http://test.com -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2012-08-07 16:00:41, elapsed: 00:00:04 ParseSegment: starting at 2012-08-07 16:00:41 ParseSegment: segment: crawl/crawldb/segments/20120807160035 Parsing: http://test.com ParseSegment: finished at 2012-08-07 16:00:44, elapsed: 00:00:02 CrawlDb update: starting at 2012-08-07 16:00:44 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/crawldb/segments/20120807160035] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2012-08-07 16:00:45, elapsed: 00:00:01 LinkDb: starting at 2012-08-07 16:00:46 LinkDb: linkdb: crawl/crawldb/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/data/nutch/crawl/crawldb/segments/20120807160035 LinkDb: finished at 2012-08-07 16:00:47, elapsed: 00:00:01 Second launch of srcipt root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh Injector: starting at 2012-08-07 16:01:30 Injector: crawlDb: crawl/crawldb Injector: urlDir: seed.txt Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-08-07 16:01:32, elapsed: 00:00:02 Generator: starting at 2012-08-07 16:01:33 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2012-08-07 16:01:35 Fetcher: segment: crawl/crawldb/segments/20120807160035 Fetcher: java.io.IOException: Segment already fetched! at org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:58) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1204) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1240) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1213) ParseSegment: starting at 2012-08-07 16:01:35 ParseSegment: segment: crawl/crawldb/segments/20120807160035 Exception in thread "main" java.io.IOException: Segment already parsed! at org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:87) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157) at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164) CrawlDb update: starting at 2012-08-07 16:01:36 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/crawldb/segments/20120807160035] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2012-08-07 16:01:37, elapsed: 00:00:01 LinkDb: starting at 2012-08-07 16:01:37 LinkDb: linkdb: crawl/crawldb/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/data/nutch/crawl/crawldb/segments/20120807160035 LinkDb: merging with existing linkdb: crawl/crawldb/linkdb LinkDb: finished at 2012-08-07 16:01:40, elapsed: 00:00:02 But when seed.txt have www.test.com instead test.com second launch of crawler script found next segment for fetching. On Sat, Aug 4, 2012 at 7:33 PM, Mathijs Homminga < mathijs.hommi...@kalooga.com> wrote: > What do you mean exactly with "it falls on fetch phase"? > Do you get an error? > Does "test.com" exist? > Does it perhaps redirect to "www.test.com"? > ... > > Mathijs > > > On Aug 4, 2012, at 17:11 , Alexei Korolev <alexei.koro...@gmail.com> > wrote: > > > yes > > > > On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney < > > lewis.mcgibb...@gmail.com> wrote: > > > >> http:// ? > >> > >> hth > >> > >> On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev < > alexei.koro...@gmail.com> > >> wrote: > >>> Hello, > >>> > >>> I have small script > >>> > >>> $NUTCH_PATH inject crawl/crawldb seed.txt > >>> $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0 > >>> > >>> s1=`ls -d crawl/crawldb/segments/* | tail -1` > >>> $NUTCH_PATH fetch $s1 > >>> $NUTCH_PATH parse $s1 > >>> $NUTCH_PATH updatedb crawl/crawldb $s1 > >>> > >>> In seed.txt I have just one site, for example "test.com". When I start > >>> script it falls on fetch phase. > >>> If I change test.com on www.test.com it works fine. Seems the reason, > >> that > >>> outgoing link on test.com all have www. prefix. > >>> What I need to change in nutch config for work with test.com? > >>> > >>> Thank you in advance. I hope my explanation is clear :) > >>> > >>> -- > >>> Alexei A. Korolev > >> > >> > >> > >> -- > >> Lewis > >> > > > > > > > > -- > > Alexei A. Korolev > > -- Alexei A. Korolev