Hi, Sebastian Seems you are right. I have db.ignore.external.links is true. But how to configure nutch for processing mobile365.ru and www.mobile365 as single site?
Thanks. On Tue, Aug 7, 2012 at 10:58 PM, Sebastian Nagel <[email protected] > wrote: > Hi Alexei, > > I tried a crawl with your scrip fragment and Nutch 1.5.1 > and the URLs http://mobile365.ru as seed. It worked, > see annotated log below. > > Which version of Nutch do you use? > > Check the property db.ignore.external.links (default is false). > If true the link from mobile365.ru to www.mobile365.ru > is skipped. > > Look into your crawldb (bin/nutch readdb) > > Check your URL filters with > bin/nutch org.apache.nutch.net.URLFilterChecker > > Finally, send the nutch-site.xml and every configuration > file you changed. > > Good luck, > Sebastian > > % nutch inject crawl/crawldb seed.txt > Injector: starting at 2012-08-07 20:31:00 > Injector: crawlDb: crawl/crawldb > Injector: urlDir: seed.txt > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: finished at 2012-08-07 20:31:15, elapsed: 00:00:15 > > % nutch generate crawl/crawldb crawl/crawldb/segments -adddays 0 > Generator: starting at 2012-08-07 20:31:23 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: true > Generator: normalizing: true > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls for politeness. > Generator: segment: crawl/crawldb/segments/20120807203131 > Generator: finished at 2012-08-07 20:31:39, elapsed: 00:00:15 > > # Note: personally, I would prefer not to place segments (also linkdb) > # in the crawldb/ folder. > > % s1=`ls -d crawl/crawldb/segments/* | tail -1` > > % nutch fetch $s1 > Fetcher: starting at 2012-08-07 20:32:00 > Fetcher: segment: crawl/crawldb/segments/20120807203131 > Using queue mode : byHost > Fetcher: threads: 10 > Fetcher: time-out divisor: 2 > QueueFeeder finished: total 1 records + hit by time limit :0 > Using queue mode : byHost > fetching http://mobile365.ru/ > Using queue mode : byHost > Using queue mode : byHost > -finishing thread FetcherThread, activeThreads=1 > Using queue mode : byHost > -finishing thread FetcherThread, activeThreads=1 > Using queue mode : byHost > -finishing thread FetcherThread, activeThreads=1 > Using queue mode : byHost > -finishing thread FetcherThread, activeThreads=1 > -finishing thread FetcherThread, activeThreads=1 > Using queue mode : byHost > -finishing thread FetcherThread, activeThreads=1 > Using queue mode : byHost > -finishing thread FetcherThread, activeThreads=1 > Using queue mode : byHost > Using queue mode : byHost > Fetcher: throughput threshold: -1 > -finishing thread FetcherThread, activeThreads=1 > Fetcher: throughput threshold retries: 5 > -finishing thread FetcherThread, activeThreads=1 > -finishing thread FetcherThread, activeThreads=0 > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=0 > Fetcher: finished at 2012-08-07 20:32:08, elapsed: 00:00:07 > > % nutch parse $s1 > ParseSegment: starting at 2012-08-07 20:32:12 > ParseSegment: segment: crawl/crawldb/segments/20120807203131 > Parsed (10ms):http://mobile365.ru/ > ParseSegment: finished at 2012-08-07 20:32:20, elapsed: 00:00:07 > > % nutch updatedb crawl/crawldb/ $s1 > CrawlDb update: starting at 2012-08-07 20:32:24 > CrawlDb update: db: crawl/crawldb > CrawlDb update: segments: [crawl/crawldb/segments/20120807203131] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: false > CrawlDb update: URL filtering: false > CrawlDb update: 404 purging: false > CrawlDb update: Merging segment data into db. > CrawlDb update: finished at 2012-08-07 20:32:38, elapsed: 00:00:13 > > # see whether the outlink is now in crawldb: > % nutch readdb crawl/crawldb/ -stats > CrawlDb statistics start: crawl/crawldb/ > Statistics for CrawlDb: crawl/crawldb/ > TOTAL urls: 2 > retry 0: 2 > min score: 1.0 > avg score: 1.0 > max score: 1.0 > status 1 (db_unfetched): 1 > status 2 (db_fetched): 1 > CrawlDb statistics: done > # => yes: http://mobile365.ru/ is fetched, outlink found > > %nutch generate crawl/crawldb crawl/crawldb/segments -adddays 0 > Generator: starting at 2012-08-07 20:32:58 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: true > Generator: normalizing: true > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls for politeness. > Generator: segment: crawl/crawldb/segments/20120807203307 > Generator: finished at 2012-08-07 20:33:14, elapsed: 00:00:15 > > % s1=`ls -d crawl/crawldb/segments/* | tail -1` > > % nutch fetch $s1 > Fetcher: starting at 2012-08-07 20:33:34 > Fetcher: segment: crawl/crawldb/segments/20120807203307 > Using queue mode : byHost > Fetcher: threads: 10 > Fetcher: time-out divisor: 2 > QueueFeeder finished: total 1 records + hit by time limit :0 > Using queue mode : byHost > fetching http://www.mobile365.ru/test.html > # got it > > > On 08/07/2012 04:37 PM, Alexei Korolev wrote: > > Hi, > > > > I made simple example > > > > Put in seed.txt > > http://mobile365.ru > > > > It will produce error. > > 20120807160035 > > Put in seed.txt > > http://www.mobile365.ru > > > > and second launch of crawler script will work fine and fetch > > http://www.mobile365.ru/test.html page. > > > > On Tue, Aug 7, 2012 at 6:23 PM, Mathijs Homminga < > > [email protected]> wrote: > > > >> Hi, > >> > >> I read from your logs: > >> - test.com is injected. > >> - test.com is fetched and parsed successfully. > >> - but when you run a generate again (second launch), no segment is > created > >> (because no url is selected) and your script tries to fetch and parse > the > >> first segment again. Hence the errors. > >>20120807160035 > >> So test.com is fetched successfully. Question remains: why is no url > >> selected in the second generate? > >> Many answers possible. Can you tell us what urls you have in your > crawldb > >> after the first cycle? Perhaps no outlinks have been found / added. > >> > >> Mathijs > >> > >> > >> > >> > >> On Aug 7, 2012, at 16:02 , Alexei Korolev <[email protected]> > >> wrote: > >> > >>> Hello, > >>> > >>> Yes, test.com and www.test.com exist. > >>> test.com do not redirect on www.test.com, it opens page with ongoing > >> link20120807160035 > >>> with www. like www.test.com/page1 www.test.com/page2 > >>> > >>> First launch of crawler script > >>> > >>> root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh > >>> Injector: starting at 2012-08-07 16:00:30 > >>> Injector: crawlDb: crawl/crawldb > >>> Injector: urlDir: seed.txt > >>> Injector: Converting injected urls to crawl db entries. > >>> Injector: Merging injected urls into crawl db. > >>> Injector: finished at 2012-08-07 16:00:32, elapsed: 00:00:02 > >>> Generator: starting at 2012-08-07 16:00:33 > >>> Generator: Selecting best-scoring urls due for fetc20120807160035h. > >>> Generator: filtering: true > >>> Generator: normalizing: true > >>> Generator: jobtracker is 'local', generating exactly one partition. > >>> Generator: Partitioning selected urls for politeness. > >>> Generator: segment: crawl/crawldb/segments/20120807160035 > >>> Generator: finished at 2012-08-07 16:00:36, elapsed: 00:00:03 > >>> Fetcher: Your 'http.agent.name' value should be listed first in > >>> 'http.robots.agents' property. > >>> Fetcher: starting at 2012-08-07 16:00:37 > >>> Fetcher: segment: crawl/crawldb/segments/20120807160035 > >>> Using queue mode : byHost > >>> Fetcher: threads: 10 > >>> Fetcher: time-out divisor: 2 > >>> QueueFeeder finished: total 1 records + hit by time limit :0 > >>> Using queue mode : byHost20120807160035 > >>> Using queue mode : byHost > >>> -finishing thread FetcherThread, activeThreads=1 > >>> Using queue mode : byHost > >>> fetching http://test.com > >>> -finishing thread FetcherThread, activeThreads=1 > >>> Using queue mode : byHost > >>> -finishing thread FetcherThread, activeThreads=1 > >>> Using queue mode : byHost > >>> -finishing thread FetcherThread, activeThreads=1 > >>> Using queue mode : byHost > >>> -finishing thread FetcherThread, activeThreads=1 > >>> Using queue mode : byHost > >>> -finishing thread FetcherThread, activeThreads=1 > >>> Using queue mode : byHost > >>> -finishing thread FetcherThread, activeThreads=1 > >>> Using queue mode : byHost > >>> -finishing thread FetcherThread, activeThreads=1 > >>> Using queue mode : byHost > >>> -finishing thread FetcherThread, activeThreads=1 > >>> Fetcher: throughput threshold: -1 > >>> Fetcher: throughput threshold retries: 5 > >>> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > >>> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > >>> -finishing thread FetcherThread, activeThreads=0 > >>> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > >>> -activeThreads=0 > >>> Fetcher: finished at 2012-08-07 16:00:41, elapsed: 00:00:04 > >>> ParseSegment: starting at 2012-08-07 16:00:41 > >>> ParseSegment: segment: crawl/crawldb/segments/20120807160035 > >>> Parsing: http://test.com > >>> ParseSegment: finished at 2012-08-07 16:00:44, elapsed: 00:00:02 > >>> CrawlDb update: starting at 2012-08-07 16:00:44 > >>> CrawlDb update: db: crawl/crawldb > >>> CrawlDb update: segments: [crawl/crawldb/segments/20120807160035] > >>> CrawlDb update: additions allowed: true > >>> CrawlDb update: URL normalizing: false > >>> CrawlDb update: URL filtering: false > >>> CrawlDb update: 404 purging: false > >>> CrawlDb update: Merging segment data into db. > >>> CrawlDb update: finished at 2012-08-07 16:00:45, elapsed: 00:00:01 > >>> LinkDb: starting at 2012-08-07 16:00:46 > >>> LinkDb: linkdb: crawl/crawldb/linkdb > >>> LinkDb: URL normalize: true > >>> LinkDb: URL filter: true > >>> LinkDb: adding segment: > >>> file:/data/nutch/crawl/crawldb/segments/20120807160035 > >>> LinkDb: finished at 2012-08-07 16:00:47, elapsed: 00:00:01 > >>> > >>> Second launch of srcipt > >>> > >>> root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh > >>> Injector: starting at 2012-08-07 16:01:30 > >>> Injector: crawlDb: crawl/crawldb > >>> Injector: urlDir: seed.txt > >>> Injector: Converting injected urls to crawl db entries. > >>> Injector: Merging injected urls into crawl db. > >>> Injector: finished at 2012-08-07 16:01:32, elapsed: 00:00:02 > >>> Generator: starting at 2012-08-07 16:01:33 > >>> Generator: Selecting best-scoring urls due for fetch. > >>> Generator: filtering: true > >>> Generator: normalizing: true > >>> Generator: jobtracker is 'local', generating exactly one partition. > >>> Generator: 0 records selected for fetching, exiting ... > >>> Fetcher: Your 'http.agent.name' value should be listed first in > >>> 'http.robots.agents' property. > >>> Fetcher: starting at 2012-08-07 16:01:35 > >>> Fetcher: segment: crawl/crawldb/segments/20120807160035 > >>> Fetcher: java.io.IOException: Segment already fetched! > >>> at > >>> > >> > org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:58) > >>> at > >>> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772) > >>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) > >>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) > >>> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1204) > >>> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1240) > >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > >>> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1213) > >>> > >>> ParseSegment: starting at 2012-08-07 16:01:35 > >>> ParseSegment: segment: crawl/crawldb/segments/20120807160035 > >>> Exception in thread "main" java.io.IOException: Segment already parsed! > >>> at > >>> > >> > org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:87) > >>> at > >>> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772) > >>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) > >>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) > >>> at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157) > >>> at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178) > >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > >>> at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164) > >>> CrawlDb update: starting at 2012-08-07 16:01:36 > >>> CrawlDb update: db: crawl/crawldb > >>> CrawlDb update: segments: [crawl/crawldb/segments/20120807160035] > >>> CrawlDb update: additions allowed: true > >>> CrawlDb update: URL normalizing: false > >>> CrawlDb update: URL filtering: false > >>> CrawlDb update: 404 purging: false > >>> CrawlDb update: Merging segment data into db. > >>> CrawlDb update: finished at 2012-08-07 16:01:37, elapsed: 00:00:01 > >>> LinkDb: starting at 2012-08-07 16:01:37 > >>> LinkDb: linkdb: crawl/crawldb/linkdb > >>> LinkDb: URL normalize: true > >>> LinkDb: URL filter: true > >>> LinkDb: adding segment: > >>> file:/data/nutch/crawl/crawldb/segments/20120807160035 > >>> LinkDb: merging with existing linkdb: crawl/crawldb/linkdb > >>> LinkDb: finished at 2012-08-07 16:01:40, elapsed: 00:00:02 > >>> > >>> > >>> But when seed.txt have www.test.com instead test.com second launch of > >>> crawler script found next segment for fetching. > >>> > >>> On Sat, Aug 4, 2012 at 7:33 PM, Mathijs Homminga < > >>> [email protected]> wrote: > >>> > >>>> What do you mean exactly with "it falls on fetch phase"? > >>>> Do you get an error? > >>>> Does "test.com" exist? > >>>> Does it perhaps redirect to "www.test.com"? > >>>> ... > >>>> > >>>> Mathijs > >>>> > >>>> > >>>> On Aug 4, 2012, at 17:11 , Alexei Korolev <[email protected]> > >>>> wrote: > >>>> > >>>>> yes > >>>>> > >>>>> On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney < > >>>>> [email protected]> wrote: > >>>>> > >>>>>> http:// ? > >>>>>> > >>>>>> hth > >>>>>> > >>>>>> On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev < > >>>> [email protected]> > >>>>>> wrote: > >>>>>>> Hello, > >>>>>>> > >>>>>>> I have small script > >>>>>>> > >>>>>>> $NUTCH_PATH inject crawl/crawldb seed.txt > >>>>>>> $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays > 0 > >>>>>>> > >>>>>>> s1=`ls -d crawl/crawldb/segments/* | tail -1` > >>>>>>> $NUTCH_PATH fetch $s1 > >>>>>>> $NUTCH_PATH parse $s1 > >>>>>>> $NUTCH_PATH updatedb crawl/crawldb $s1 > >>>>>>> > >>>>>>> In seed.txt I have just one site, for example "test.com". When I > >> start > >>>>>>> script it falls on fetch phase. > >>>>>>> If I change test.com on www.test.com it works fine. Seems the > >> reason, > >>>>>> that > >>>>>>> outgoing link on test.com all have www. prefix. > >>>>>>> What I need to change in nutch config for work with test.com? > >>>>>>> > >>>>>>> Thank you in advance. I hope my explanation is clear :) > >>>>>>> > >>>>>>> -- > >>>>>>> Alexei A. Korolev > >>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> Lewis > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Alexei A. Korolev > >>>> > >>>> > >>> > >>> > >>> -- > >>> Alexei A. Korolev > >> > >> > > > > > > -- Alexei A. Korolev

