-----Original message----- > From:Alexei Korolev <alexei.koro...@gmail.com> > Sent: Wed 08-Aug-2012 15:43 > To: user@nutch.apache.org > Subject: Re: crawling site without www > > Hi, Sebastian > > Seems you are right. I have db.ignore.external.links is true. > But how to configure nutch for processing mobile365.ru and www.mobile365 as > single site?
You can use the HostURLNormalizer for this task or just crawl the www OR the non-www, not both. > > Thanks. > > On Tue, Aug 7, 2012 at 10:58 PM, Sebastian Nagel <wastl.na...@googlemail.com > > wrote: > > > Hi Alexei, > > > > I tried a crawl with your scrip fragment and Nutch 1.5.1 > > and the URLs http://mobile365.ru as seed. It worked, > > see annotated log below. > > > > Which version of Nutch do you use? > > > > Check the property db.ignore.external.links (default is false). > > If true the link from mobile365.ru to www.mobile365.ru > > is skipped. > > > > Look into your crawldb (bin/nutch readdb) > > > > Check your URL filters with > > bin/nutch org.apache.nutch.net.URLFilterChecker > > > > Finally, send the nutch-site.xml and every configuration > > file you changed. > > > > Good luck, > > Sebastian > > > > % nutch inject crawl/crawldb seed.txt > > Injector: starting at 2012-08-07 20:31:00 > > Injector: crawlDb: crawl/crawldb > > Injector: urlDir: seed.txt > > Injector: Converting injected urls to crawl db entries. > > Injector: Merging injected urls into crawl db. > > Injector: finished at 2012-08-07 20:31:15, elapsed: 00:00:15 > > > > % nutch generate crawl/crawldb crawl/crawldb/segments -adddays 0 > > Generator: starting at 2012-08-07 20:31:23 > > Generator: Selecting best-scoring urls due for fetch. > > Generator: filtering: true > > Generator: normalizing: true > > Generator: jobtracker is 'local', generating exactly one partition. > > Generator: Partitioning selected urls for politeness. > > Generator: segment: crawl/crawldb/segments/20120807203131 > > Generator: finished at 2012-08-07 20:31:39, elapsed: 00:00:15 > > > > # Note: personally, I would prefer not to place segments (also linkdb) > > # in the crawldb/ folder. > > > > % s1=`ls -d crawl/crawldb/segments/* | tail -1` > > > > % nutch fetch $s1 > > Fetcher: starting at 2012-08-07 20:32:00 > > Fetcher: segment: crawl/crawldb/segments/20120807203131 > > Using queue mode : byHost > > Fetcher: threads: 10 > > Fetcher: time-out divisor: 2 > > QueueFeeder finished: total 1 records + hit by time limit :0 > > Using queue mode : byHost > > fetching http://mobile365.ru/ > > Using queue mode : byHost > > Using queue mode : byHost > > -finishing thread FetcherThread, activeThreads=1 > > Using queue mode : byHost > > -finishing thread FetcherThread, activeThreads=1 > > Using queue mode : byHost > > -finishing thread FetcherThread, activeThreads=1 > > Using queue mode : byHost > > -finishing thread FetcherThread, activeThreads=1 > > -finishing thread FetcherThread, activeThreads=1 > > Using queue mode : byHost > > -finishing thread FetcherThread, activeThreads=1 > > Using queue mode : byHost > > -finishing thread FetcherThread, activeThreads=1 > > Using queue mode : byHost > > Using queue mode : byHost > > Fetcher: throughput threshold: -1 > > -finishing thread FetcherThread, activeThreads=1 > > Fetcher: throughput threshold retries: 5 > > -finishing thread FetcherThread, activeThreads=1 > > -finishing thread FetcherThread, activeThreads=0 > > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > > -activeThreads=0 > > Fetcher: finished at 2012-08-07 20:32:08, elapsed: 00:00:07 > > > > % nutch parse $s1 > > ParseSegment: starting at 2012-08-07 20:32:12 > > ParseSegment: segment: crawl/crawldb/segments/20120807203131 > > Parsed (10ms):http://mobile365.ru/ > > ParseSegment: finished at 2012-08-07 20:32:20, elapsed: 00:00:07 > > > > % nutch updatedb crawl/crawldb/ $s1 > > CrawlDb update: starting at 2012-08-07 20:32:24 > > CrawlDb update: db: crawl/crawldb > > CrawlDb update: segments: [crawl/crawldb/segments/20120807203131] > > CrawlDb update: additions allowed: true > > CrawlDb update: URL normalizing: false > > CrawlDb update: URL filtering: false > > CrawlDb update: 404 purging: false > > CrawlDb update: Merging segment data into db. > > CrawlDb update: finished at 2012-08-07 20:32:38, elapsed: 00:00:13 > > > > # see whether the outlink is now in crawldb: > > % nutch readdb crawl/crawldb/ -stats > > CrawlDb statistics start: crawl/crawldb/ > > Statistics for CrawlDb: crawl/crawldb/ > > TOTAL urls: 2 > > retry 0: 2 > > min score: 1.0 > > avg score: 1.0 > > max score: 1.0 > > status 1 (db_unfetched): 1 > > status 2 (db_fetched): 1 > > CrawlDb statistics: done > > # => yes: http://mobile365.ru/ is fetched, outlink found > > > > %nutch generate crawl/crawldb crawl/crawldb/segments -adddays 0 > > Generator: starting at 2012-08-07 20:32:58 > > Generator: Selecting best-scoring urls due for fetch. > > Generator: filtering: true > > Generator: normalizing: true > > Generator: jobtracker is 'local', generating exactly one partition. > > Generator: Partitioning selected urls for politeness. > > Generator: segment: crawl/crawldb/segments/20120807203307 > > Generator: finished at 2012-08-07 20:33:14, elapsed: 00:00:15 > > > > % s1=`ls -d crawl/crawldb/segments/* | tail -1` > > > > % nutch fetch $s1 > > Fetcher: starting at 2012-08-07 20:33:34 > > Fetcher: segment: crawl/crawldb/segments/20120807203307 > > Using queue mode : byHost > > Fetcher: threads: 10 > > Fetcher: time-out divisor: 2 > > QueueFeeder finished: total 1 records + hit by time limit :0 > > Using queue mode : byHost > > fetching http://www.mobile365.ru/test.html > > # got it > > > > > > On 08/07/2012 04:37 PM, Alexei Korolev wrote: > > > Hi, > > > > > > I made simple example > > > > > > Put in seed.txt > > > http://mobile365.ru > > > > > > It will produce error. > > > 20120807160035 > > > Put in seed.txt > > > http://www.mobile365.ru > > > > > > and second launch of crawler script will work fine and fetch > > > http://www.mobile365.ru/test.html page. > > > > > > On Tue, Aug 7, 2012 at 6:23 PM, Mathijs Homminga < > > > mathijs.hommi...@kalooga.com> wrote: > > > > > >> Hi, > > >> > > >> I read from your logs: > > >> - test.com is injected. > > >> - test.com is fetched and parsed successfully. > > >> - but when you run a generate again (second launch), no segment is > > created > > >> (because no url is selected) and your script tries to fetch and parse > > the > > >> first segment again. Hence the errors. > > >>20120807160035 > > >> So test.com is fetched successfully. Question remains: why is no url > > >> selected in the second generate? > > >> Many answers possible. Can you tell us what urls you have in your > > crawldb > > >> after the first cycle? Perhaps no outlinks have been found / added. > > >> > > >> Mathijs > > >> > > >> > > >> > > >> > > >> On Aug 7, 2012, at 16:02 , Alexei Korolev <alexei.koro...@gmail.com> > > >> wrote: > > >> > > >>> Hello, > > >>> > > >>> Yes, test.com and www.test.com exist. > > >>> test.com do not redirect on www.test.com, it opens page with ongoing > > >> link20120807160035 > > >>> with www. like www.test.com/page1 www.test.com/page2 > > >>> > > >>> First launch of crawler script > > >>> > > >>> root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh > > >>> Injector: starting at 2012-08-07 16:00:30 > > >>> Injector: crawlDb: crawl/crawldb > > >>> Injector: urlDir: seed.txt > > >>> Injector: Converting injected urls to crawl db entries. > > >>> Injector: Merging injected urls into crawl db. > > >>> Injector: finished at 2012-08-07 16:00:32, elapsed: 00:00:02 > > >>> Generator: starting at 2012-08-07 16:00:33 > > >>> Generator: Selecting best-scoring urls due for fetc20120807160035h. > > >>> Generator: filtering: true > > >>> Generator: normalizing: true > > >>> Generator: jobtracker is 'local', generating exactly one partition. > > >>> Generator: Partitioning selected urls for politeness. > > >>> Generator: segment: crawl/crawldb/segments/20120807160035 > > >>> Generator: finished at 2012-08-07 16:00:36, elapsed: 00:00:03 > > >>> Fetcher: Your 'http.agent.name' value should be listed first in > > >>> 'http.robots.agents' property. > > >>> Fetcher: starting at 2012-08-07 16:00:37 > > >>> Fetcher: segment: crawl/crawldb/segments/20120807160035 > > >>> Using queue mode : byHost > > >>> Fetcher: threads: 10 > > >>> Fetcher: time-out divisor: 2 > > >>> QueueFeeder finished: total 1 records + hit by time limit :0 > > >>> Using queue mode : byHost20120807160035 > > >>> Using queue mode : byHost > > >>> -finishing thread FetcherThread, activeThreads=1 > > >>> Using queue mode : byHost > > >>> fetching http://test.com > > >>> -finishing thread FetcherThread, activeThreads=1 > > >>> Using queue mode : byHost > > >>> -finishing thread FetcherThread, activeThreads=1 > > >>> Using queue mode : byHost > > >>> -finishing thread FetcherThread, activeThreads=1 > > >>> Using queue mode : byHost > > >>> -finishing thread FetcherThread, activeThreads=1 > > >>> Using queue mode : byHost > > >>> -finishing thread FetcherThread, activeThreads=1 > > >>> Using queue mode : byHost > > >>> -finishing thread FetcherThread, activeThreads=1 > > >>> Using queue mode : byHost > > >>> -finishing thread FetcherThread, activeThreads=1 > > >>> Using queue mode : byHost > > >>> -finishing thread FetcherThread, activeThreads=1 > > >>> Fetcher: throughput threshold: -1 > > >>> Fetcher: throughput threshold retries: 5 > > >>> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > > >>> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > > >>> -finishing thread FetcherThread, activeThreads=0 > > >>> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > > >>> -activeThreads=0 > > >>> Fetcher: finished at 2012-08-07 16:00:41, elapsed: 00:00:04 > > >>> ParseSegment: starting at 2012-08-07 16:00:41 > > >>> ParseSegment: segment: crawl/crawldb/segments/20120807160035 > > >>> Parsing: http://test.com > > >>> ParseSegment: finished at 2012-08-07 16:00:44, elapsed: 00:00:02 > > >>> CrawlDb update: starting at 2012-08-07 16:00:44 > > >>> CrawlDb update: db: crawl/crawldb > > >>> CrawlDb update: segments: [crawl/crawldb/segments/20120807160035] > > >>> CrawlDb update: additions allowed: true > > >>> CrawlDb update: URL normalizing: false > > >>> CrawlDb update: URL filtering: false > > >>> CrawlDb update: 404 purging: false > > >>> CrawlDb update: Merging segment data into db. > > >>> CrawlDb update: finished at 2012-08-07 16:00:45, elapsed: 00:00:01 > > >>> LinkDb: starting at 2012-08-07 16:00:46 > > >>> LinkDb: linkdb: crawl/crawldb/linkdb > > >>> LinkDb: URL normalize: true > > >>> LinkDb: URL filter: true > > >>> LinkDb: adding segment: > > >>> file:/data/nutch/crawl/crawldb/segments/20120807160035 > > >>> LinkDb: finished at 2012-08-07 16:00:47, elapsed: 00:00:01 > > >>> > > >>> Second launch of srcipt > > >>> > > >>> root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh > > >>> Injector: starting at 2012-08-07 16:01:30 > > >>> Injector: crawlDb: crawl/crawldb > > >>> Injector: urlDir: seed.txt > > >>> Injector: Converting injected urls to crawl db entries. > > >>> Injector: Merging injected urls into crawl db. > > >>> Injector: finished at 2012-08-07 16:01:32, elapsed: 00:00:02 > > >>> Generator: starting at 2012-08-07 16:01:33 > > >>> Generator: Selecting best-scoring urls due for fetch. > > >>> Generator: filtering: true > > >>> Generator: normalizing: true > > >>> Generator: jobtracker is 'local', generating exactly one partition. > > >>> Generator: 0 records selected for fetching, exiting ... > > >>> Fetcher: Your 'http.agent.name' value should be listed first in > > >>> 'http.robots.agents' property. > > >>> Fetcher: starting at 2012-08-07 16:01:35 > > >>> Fetcher: segment: crawl/crawldb/segments/20120807160035 > > >>> Fetcher: java.io.IOException: Segment already fetched! > > >>> at > > >>> > > >> > > org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:58) > > >>> at > > >>> > > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772) > > >>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) > > >>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) > > >>> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1204) > > >>> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1240) > > >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > >>> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1213) > > >>> > > >>> ParseSegment: starting at 2012-08-07 16:01:35 > > >>> ParseSegment: segment: crawl/crawldb/segments/20120807160035 > > >>> Exception in thread "main" java.io.IOException: Segment already parsed! > > >>> at > > >>> > > >> > > org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:87) > > >>> at > > >>> > > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772) > > >>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) > > >>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) > > >>> at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157) > > >>> at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178) > > >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > >>> at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164) > > >>> CrawlDb update: starting at 2012-08-07 16:01:36 > > >>> CrawlDb update: db: crawl/crawldb > > >>> CrawlDb update: segments: [crawl/crawldb/segments/20120807160035] > > >>> CrawlDb update: additions allowed: true > > >>> CrawlDb update: URL normalizing: false > > >>> CrawlDb update: URL filtering: false > > >>> CrawlDb update: 404 purging: false > > >>> CrawlDb update: Merging segment data into db. > > >>> CrawlDb update: finished at 2012-08-07 16:01:37, elapsed: 00:00:01 > > >>> LinkDb: starting at 2012-08-07 16:01:37 > > >>> LinkDb: linkdb: crawl/crawldb/linkdb > > >>> LinkDb: URL normalize: true > > >>> LinkDb: URL filter: true > > >>> LinkDb: adding segment: > > >>> file:/data/nutch/crawl/crawldb/segments/20120807160035 > > >>> LinkDb: merging with existing linkdb: crawl/crawldb/linkdb > > >>> LinkDb: finished at 2012-08-07 16:01:40, elapsed: 00:00:02 > > >>> > > >>> > > >>> But when seed.txt have www.test.com instead test.com second launch of > > >>> crawler script found next segment for fetching. > > >>> > > >>> On Sat, Aug 4, 2012 at 7:33 PM, Mathijs Homminga < > > >>> mathijs.hommi...@kalooga.com> wrote: > > >>> > > >>>> What do you mean exactly with "it falls on fetch phase"? > > >>>> Do you get an error? > > >>>> Does "test.com" exist? > > >>>> Does it perhaps redirect to "www.test.com"? > > >>>> ... > > >>>> > > >>>> Mathijs > > >>>> > > >>>> > > >>>> On Aug 4, 2012, at 17:11 , Alexei Korolev <alexei.koro...@gmail.com> > > >>>> wrote: > > >>>> > > >>>>> yes > > >>>>> > > >>>>> On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney < > > >>>>> lewis.mcgibb...@gmail.com> wrote: > > >>>>> > > >>>>>> http:// ? > > >>>>>> > > >>>>>> hth > > >>>>>> > > >>>>>> On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev < > > >>>> alexei.koro...@gmail.com> > > >>>>>> wrote: > > >>>>>>> Hello, > > >>>>>>> > > >>>>>>> I have small script > > >>>>>>> > > >>>>>>> $NUTCH_PATH inject crawl/crawldb seed.txt > > >>>>>>> $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays > > 0 > > >>>>>>> > > >>>>>>> s1=`ls -d crawl/crawldb/segments/* | tail -1` > > >>>>>>> $NUTCH_PATH fetch $s1 > > >>>>>>> $NUTCH_PATH parse $s1 > > >>>>>>> $NUTCH_PATH updatedb crawl/crawldb $s1 > > >>>>>>> > > >>>>>>> In seed.txt I have just one site, for example "test.com". When I > > >> start > > >>>>>>> script it falls on fetch phase. > > >>>>>>> If I change test.com on www.test.com it works fine. Seems the > > >> reason, > > >>>>>> that > > >>>>>>> outgoing link on test.com all have www. prefix. > > >>>>>>> What I need to change in nutch config for work with test.com? > > >>>>>>> > > >>>>>>> Thank you in advance. I hope my explanation is clear :) > > >>>>>>> > > >>>>>>> -- > > >>>>>>> Alexei A. Korolev > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> -- > > >>>>>> Lewis > > >>>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> -- > > >>>>> Alexei A. Korolev > > >>>> > > >>>> > > >>> > > >>> > > >>> -- > > >>> Alexei A. Korolev > > >> > > >> > > > > > > > > > > > > > -- > Alexei A. Korolev >