Re: crawling site without www
Hi, Sebastian Seems you are right. I have db.ignore.external.links is true. But how to configure nutch for processing mobile365.ru and www.mobile365 as single site? Thanks. On Tue, Aug 7, 2012 at 10:58 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi Alexei, I tried a crawl with your scrip fragment and Nutch 1.5.1 and the URLs http://mobile365.ru as seed. It worked, see annotated log below. Which version of Nutch do you use? Check the property db.ignore.external.links (default is false). If true the link from mobile365.ru to www.mobile365.ru is skipped. Look into your crawldb (bin/nutch readdb) Check your URL filters with bin/nutch org.apache.nutch.net.URLFilterChecker Finally, send the nutch-site.xml and every configuration file you changed. Good luck, Sebastian % nutch inject crawl/crawldb seed.txt Injector: starting at 2012-08-07 20:31:00 Injector: crawlDb: crawl/crawldb Injector: urlDir: seed.txt Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-08-07 20:31:15, elapsed: 00:00:15 % nutch generate crawl/crawldb crawl/crawldb/segments -adddays 0 Generator: starting at 2012-08-07 20:31:23 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/crawldb/segments/20120807203131 Generator: finished at 2012-08-07 20:31:39, elapsed: 00:00:15 # Note: personally, I would prefer not to place segments (also linkdb) # in the crawldb/ folder. % s1=`ls -d crawl/crawldb/segments/* | tail -1` % nutch fetch $s1 Fetcher: starting at 2012-08-07 20:32:00 Fetcher: segment: crawl/crawldb/segments/20120807203131 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost fetching http://mobile365.ru/ Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost Fetcher: throughput threshold: -1 -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold retries: 5 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2012-08-07 20:32:08, elapsed: 00:00:07 % nutch parse $s1 ParseSegment: starting at 2012-08-07 20:32:12 ParseSegment: segment: crawl/crawldb/segments/20120807203131 Parsed (10ms):http://mobile365.ru/ ParseSegment: finished at 2012-08-07 20:32:20, elapsed: 00:00:07 % nutch updatedb crawl/crawldb/ $s1 CrawlDb update: starting at 2012-08-07 20:32:24 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/crawldb/segments/20120807203131] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2012-08-07 20:32:38, elapsed: 00:00:13 # see whether the outlink is now in crawldb: % nutch readdb crawl/crawldb/ -stats CrawlDb statistics start: crawl/crawldb/ Statistics for CrawlDb: crawl/crawldb/ TOTAL urls: 2 retry 0:2 min score: 1.0 avg score: 1.0 max score: 1.0 status 1 (db_unfetched):1 status 2 (db_fetched): 1 CrawlDb statistics: done # = yes: http://mobile365.ru/ is fetched, outlink found %nutch generate crawl/crawldb crawl/crawldb/segments -adddays 0 Generator: starting at 2012-08-07 20:32:58 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/crawldb/segments/20120807203307 Generator: finished at 2012-08-07 20:33:14, elapsed: 00:00:15 % s1=`ls -d crawl/crawldb/segments/* | tail -1` % nutch fetch $s1 Fetcher: starting at 2012-08-07 20:33:34 Fetcher: segment: crawl/crawldb/segments/20120807203307 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost fetching http://www.mobile365.ru/test.html # got it On
RE: crawling site without www
-Original message- From:Alexei Korolev alexei.koro...@gmail.com Sent: Wed 08-Aug-2012 15:43 To: user@nutch.apache.org Subject: Re: crawling site without www Hi, Sebastian Seems you are right. I have db.ignore.external.links is true. But how to configure nutch for processing mobile365.ru and www.mobile365 as single site? You can use the HostURLNormalizer for this task or just crawl the www OR the non-www, not both. Thanks. On Tue, Aug 7, 2012 at 10:58 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi Alexei, I tried a crawl with your scrip fragment and Nutch 1.5.1 and the URLs http://mobile365.ru as seed. It worked, see annotated log below. Which version of Nutch do you use? Check the property db.ignore.external.links (default is false). If true the link from mobile365.ru to www.mobile365.ru is skipped. Look into your crawldb (bin/nutch readdb) Check your URL filters with bin/nutch org.apache.nutch.net.URLFilterChecker Finally, send the nutch-site.xml and every configuration file you changed. Good luck, Sebastian % nutch inject crawl/crawldb seed.txt Injector: starting at 2012-08-07 20:31:00 Injector: crawlDb: crawl/crawldb Injector: urlDir: seed.txt Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-08-07 20:31:15, elapsed: 00:00:15 % nutch generate crawl/crawldb crawl/crawldb/segments -adddays 0 Generator: starting at 2012-08-07 20:31:23 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/crawldb/segments/20120807203131 Generator: finished at 2012-08-07 20:31:39, elapsed: 00:00:15 # Note: personally, I would prefer not to place segments (also linkdb) # in the crawldb/ folder. % s1=`ls -d crawl/crawldb/segments/* | tail -1` % nutch fetch $s1 Fetcher: starting at 2012-08-07 20:32:00 Fetcher: segment: crawl/crawldb/segments/20120807203131 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost fetching http://mobile365.ru/ Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost Fetcher: throughput threshold: -1 -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold retries: 5 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2012-08-07 20:32:08, elapsed: 00:00:07 % nutch parse $s1 ParseSegment: starting at 2012-08-07 20:32:12 ParseSegment: segment: crawl/crawldb/segments/20120807203131 Parsed (10ms):http://mobile365.ru/ ParseSegment: finished at 2012-08-07 20:32:20, elapsed: 00:00:07 % nutch updatedb crawl/crawldb/ $s1 CrawlDb update: starting at 2012-08-07 20:32:24 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/crawldb/segments/20120807203131] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2012-08-07 20:32:38, elapsed: 00:00:13 # see whether the outlink is now in crawldb: % nutch readdb crawl/crawldb/ -stats CrawlDb statistics start: crawl/crawldb/ Statistics for CrawlDb: crawl/crawldb/ TOTAL urls: 2 retry 0:2 min score: 1.0 avg score: 1.0 max score: 1.0 status 1 (db_unfetched):1 status 2 (db_fetched): 1 CrawlDb statistics: done # = yes: http://mobile365.ru/ is fetched, outlink found %nutch generate crawl/crawldb crawl/crawldb/segments -adddays 0 Generator: starting at 2012-08-07 20:32:58 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/crawldb/segments/20120807203307 Generator: finished at 2012-08-07 20:33:14, elapsed: 00:00
Re: crawling site without www
You can use the HostURLNormalizer for this task or just crawl the www OR the non-www, not both. I'm trying to crawl only version without www. As I see, I can remove www. using proper configured regex-normalize.xml. But will it work if mobile365.ru redirect on www.mobile365.ru (it's very common situation in web) Thanks. Alexei
RE: crawling site without www
If it starts to redirect and you are on the wrong side of the redirect, you're in trouble. But with the HostNormalizer you can then renormalize all URL's to the host that is being redirected to. -Original message- From:Alexei Korolev alexei.koro...@gmail.com Sent: Wed 08-Aug-2012 15:55 To: user@nutch.apache.org Subject: Re: crawling site without www You can use the HostURLNormalizer for this task or just crawl the www OR the non-www, not both. I'm trying to crawl only version without www. As I see, I can remove www. using proper configured regex-normalize.xml. But will it work if mobile365.ru redirect on www.mobile365.ru (it's very common situation in web) Thanks. Alexei
Re: crawling site without www
So I see just one solution for crawling limited count of sites with behaviour like on mobile365. Its limit scope of sites using regex-urlfilter.txt with list like this +^www.mobile365.ru +^mobile365.ru Thanks. On Wed, Aug 8, 2012 at 5:56 PM, Markus Jelsma markus.jel...@openindex.iowrote: If it starts to redirect and you are on the wrong side of the redirect, you're in trouble. But with the HostNormalizer you can then renormalize all URL's to the host that is being redirected to. -Original message- From:Alexei Korolev alexei.koro...@gmail.com Sent: Wed 08-Aug-2012 15:55 To: user@nutch.apache.org Subject: Re: crawling site without www You can use the HostURLNormalizer for this task or just crawl the www OR the non-www, not both. I'm trying to crawl only version without www. As I see, I can remove www. using proper configured regex-normalize.xml. But will it work if mobile365.ru redirect on www.mobile365.ru (it's very common situation in web) Thanks. Alexei -- Alexei A. Korolev
Re: crawling site without www
Hi Alexei, So I see just one solution for crawling limited count of sites with behaviour like on mobile365. Its limit scope of sites using regex-urlfilter.txt with list like this +^www.mobile365.ru +^mobile365.ru Better: +^https?://(?:www\.)?mobile365\.ru/ or to catch all of mobile365.ru +^https?://(?:[a-z0-9-]+\.)*mobile365\.ru/ and don't forget to remove the final rule # accept anything else +. and replace it by # skip everything else -. If you have more than a few hosts / domains you want to allow the urlfilter-domain would be a more comfortable choice. Here a simple line has the desired effect: mobile365.ru Sebastian Thanks. On Wed, Aug 8, 2012 at 5:56 PM, Markus Jelsma markus.jel...@openindex.iowrote: If it starts to redirect and you are on the wrong side of the redirect, you're in trouble. But with the HostNormalizer you can then renormalize all URL's to the host that is being redirected to. -Original message- From:Alexei Korolev alexei.koro...@gmail.com Sent: Wed 08-Aug-2012 15:55 To: user@nutch.apache.org Subject: Re: crawling site without www You can use the HostURLNormalizer for this task or just crawl the www OR the non-www, not both. I'm trying to crawl only version without www. As I see, I can remove www. using proper configured regex-normalize.xml. But will it work if mobile365.ru redirect on www.mobile365.ru (it's very common situation in web) Thanks. Alexei
Re: crawling site without www
Ok. Thank you a lot. I'll try later :) On Wed, Aug 8, 2012 at 9:18 PM, Sebastian Nagel wastl.na...@googlemail.comwrote: Hi Alexei, So I see just one solution for crawling limited count of sites with behaviour like on mobile365. Its limit scope of sites using regex-urlfilter.txt with list like this +^www.mobile365.ru +^mobile365.ru Better: +^https?://(?:www\.)?mobile365\.ru/ or to catch all of mobile365.ru +^https?://(?:[a-z0-9-]+\.)*mobile365\.ru/ and don't forget to remove the final rule # accept anything else +. and replace it by # skip everything else -. If you have more than a few hosts / domains you want to allow the urlfilter-domain would be a more comfortable choice. Here a simple line has the desired effect: mobile365.ru Sebastian Thanks. On Wed, Aug 8, 2012 at 5:56 PM, Markus Jelsma markus.jel...@openindex.iowrote: If it starts to redirect and you are on the wrong side of the redirect, you're in trouble. But with the HostNormalizer you can then renormalize all URL's to the host that is being redirected to. -Original message- From:Alexei Korolev alexei.koro...@gmail.com Sent: Wed 08-Aug-2012 15:55 To: user@nutch.apache.org Subject: Re: crawling site without www You can use the HostURLNormalizer for this task or just crawl the www OR the non-www, not both. I'm trying to crawl only version without www. As I see, I can remove www. using proper configured regex-normalize.xml. But will it work if mobile365.ru redirect on www.mobile365.ru (it's very common situation in web) Thanks. Alexei -- Alexei A. Korolev
Re: crawling site without www
Hello, Yes, test.com and www.test.com exist. test.com do not redirect on www.test.com, it opens page with ongoing link with www. like www.test.com/page1 www.test.com/page2 First launch of crawler script root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh Injector: starting at 2012-08-07 16:00:30 Injector: crawlDb: crawl/crawldb Injector: urlDir: seed.txt Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-08-07 16:00:32, elapsed: 00:00:02 Generator: starting at 2012-08-07 16:00:33 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/crawldb/segments/20120807160035 Generator: finished at 2012-08-07 16:00:36, elapsed: 00:00:03 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2012-08-07 16:00:37 Fetcher: segment: crawl/crawldb/segments/20120807160035 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost fetching http://test.com -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2012-08-07 16:00:41, elapsed: 00:00:04 ParseSegment: starting at 2012-08-07 16:00:41 ParseSegment: segment: crawl/crawldb/segments/20120807160035 Parsing: http://test.com ParseSegment: finished at 2012-08-07 16:00:44, elapsed: 00:00:02 CrawlDb update: starting at 2012-08-07 16:00:44 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/crawldb/segments/20120807160035] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2012-08-07 16:00:45, elapsed: 00:00:01 LinkDb: starting at 2012-08-07 16:00:46 LinkDb: linkdb: crawl/crawldb/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/data/nutch/crawl/crawldb/segments/20120807160035 LinkDb: finished at 2012-08-07 16:00:47, elapsed: 00:00:01 Second launch of srcipt root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh Injector: starting at 2012-08-07 16:01:30 Injector: crawlDb: crawl/crawldb Injector: urlDir: seed.txt Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-08-07 16:01:32, elapsed: 00:00:02 Generator: starting at 2012-08-07 16:01:33 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2012-08-07 16:01:35 Fetcher: segment: crawl/crawldb/segments/20120807160035 Fetcher: java.io.IOException: Segment already fetched! at org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:58) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1204) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1240) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1213) ParseSegment: starting at 2012-08-07 16:01:35 ParseSegment: segment: crawl/crawldb/segments/20120807160035 Exception in thread main java.io.IOException: Segment already parsed! at org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:87) at
Re: crawling site without www
Hi, I read from your logs: - test.com is injected. - test.com is fetched and parsed successfully. - but when you run a generate again (second launch), no segment is created (because no url is selected) and your script tries to fetch and parse the first segment again. Hence the errors. So test.com is fetched successfully. Question remains: why is no url selected in the second generate? Many answers possible. Can you tell us what urls you have in your crawldb after the first cycle? Perhaps no outlinks have been found / added. Mathijs On Aug 7, 2012, at 16:02 , Alexei Korolev alexei.koro...@gmail.com wrote: Hello, Yes, test.com and www.test.com exist. test.com do not redirect on www.test.com, it opens page with ongoing link with www. like www.test.com/page1 www.test.com/page2 First launch of crawler script root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh Injector: starting at 2012-08-07 16:00:30 Injector: crawlDb: crawl/crawldb Injector: urlDir: seed.txt Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-08-07 16:00:32, elapsed: 00:00:02 Generator: starting at 2012-08-07 16:00:33 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/crawldb/segments/20120807160035 Generator: finished at 2012-08-07 16:00:36, elapsed: 00:00:03 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2012-08-07 16:00:37 Fetcher: segment: crawl/crawldb/segments/20120807160035 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost fetching http://test.com -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2012-08-07 16:00:41, elapsed: 00:00:04 ParseSegment: starting at 2012-08-07 16:00:41 ParseSegment: segment: crawl/crawldb/segments/20120807160035 Parsing: http://test.com ParseSegment: finished at 2012-08-07 16:00:44, elapsed: 00:00:02 CrawlDb update: starting at 2012-08-07 16:00:44 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/crawldb/segments/20120807160035] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2012-08-07 16:00:45, elapsed: 00:00:01 LinkDb: starting at 2012-08-07 16:00:46 LinkDb: linkdb: crawl/crawldb/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/data/nutch/crawl/crawldb/segments/20120807160035 LinkDb: finished at 2012-08-07 16:00:47, elapsed: 00:00:01 Second launch of srcipt root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh Injector: starting at 2012-08-07 16:01:30 Injector: crawlDb: crawl/crawldb Injector: urlDir: seed.txt Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-08-07 16:01:32, elapsed: 00:00:02 Generator: starting at 2012-08-07 16:01:33 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2012-08-07 16:01:35 Fetcher: segment: crawl/crawldb/segments/20120807160035 Fetcher: java.io.IOException: Segment already fetched! at org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:58) at
Re: crawling site without www
Hi, I made simple example Put in seed.txt http://mobile365.ru It will produce error. Put in seed.txt http://www.mobile365.ru and second launch of crawler script will work fine and fetch http://www.mobile365.ru/test.html page. On Tue, Aug 7, 2012 at 6:23 PM, Mathijs Homminga mathijs.hommi...@kalooga.com wrote: Hi, I read from your logs: - test.com is injected. - test.com is fetched and parsed successfully. - but when you run a generate again (second launch), no segment is created (because no url is selected) and your script tries to fetch and parse the first segment again. Hence the errors. So test.com is fetched successfully. Question remains: why is no url selected in the second generate? Many answers possible. Can you tell us what urls you have in your crawldb after the first cycle? Perhaps no outlinks have been found / added. Mathijs On Aug 7, 2012, at 16:02 , Alexei Korolev alexei.koro...@gmail.com wrote: Hello, Yes, test.com and www.test.com exist. test.com do not redirect on www.test.com, it opens page with ongoing link with www. like www.test.com/page1 www.test.com/page2 First launch of crawler script root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh Injector: starting at 2012-08-07 16:00:30 Injector: crawlDb: crawl/crawldb Injector: urlDir: seed.txt Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-08-07 16:00:32, elapsed: 00:00:02 Generator: starting at 2012-08-07 16:00:33 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/crawldb/segments/20120807160035 Generator: finished at 2012-08-07 16:00:36, elapsed: 00:00:03 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2012-08-07 16:00:37 Fetcher: segment: crawl/crawldb/segments/20120807160035 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost fetching http://test.com -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2012-08-07 16:00:41, elapsed: 00:00:04 ParseSegment: starting at 2012-08-07 16:00:41 ParseSegment: segment: crawl/crawldb/segments/20120807160035 Parsing: http://test.com ParseSegment: finished at 2012-08-07 16:00:44, elapsed: 00:00:02 CrawlDb update: starting at 2012-08-07 16:00:44 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/crawldb/segments/20120807160035] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2012-08-07 16:00:45, elapsed: 00:00:01 LinkDb: starting at 2012-08-07 16:00:46 LinkDb: linkdb: crawl/crawldb/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/data/nutch/crawl/crawldb/segments/20120807160035 LinkDb: finished at 2012-08-07 16:00:47, elapsed: 00:00:01 Second launch of srcipt root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh Injector: starting at 2012-08-07 16:01:30 Injector: crawlDb: crawl/crawldb Injector: urlDir: seed.txt Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-08-07 16:01:32, elapsed: 00:00:02 Generator: starting at 2012-08-07 16:01:33 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ...
Re: crawling site without www
Hi Alexei, I tried a crawl with your scrip fragment and Nutch 1.5.1 and the URLs http://mobile365.ru as seed. It worked, see annotated log below. Which version of Nutch do you use? Check the property db.ignore.external.links (default is false). If true the link from mobile365.ru to www.mobile365.ru is skipped. Look into your crawldb (bin/nutch readdb) Check your URL filters with bin/nutch org.apache.nutch.net.URLFilterChecker Finally, send the nutch-site.xml and every configuration file you changed. Good luck, Sebastian % nutch inject crawl/crawldb seed.txt Injector: starting at 2012-08-07 20:31:00 Injector: crawlDb: crawl/crawldb Injector: urlDir: seed.txt Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-08-07 20:31:15, elapsed: 00:00:15 % nutch generate crawl/crawldb crawl/crawldb/segments -adddays 0 Generator: starting at 2012-08-07 20:31:23 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/crawldb/segments/20120807203131 Generator: finished at 2012-08-07 20:31:39, elapsed: 00:00:15 # Note: personally, I would prefer not to place segments (also linkdb) # in the crawldb/ folder. % s1=`ls -d crawl/crawldb/segments/* | tail -1` % nutch fetch $s1 Fetcher: starting at 2012-08-07 20:32:00 Fetcher: segment: crawl/crawldb/segments/20120807203131 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost fetching http://mobile365.ru/ Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost Fetcher: throughput threshold: -1 -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold retries: 5 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2012-08-07 20:32:08, elapsed: 00:00:07 % nutch parse $s1 ParseSegment: starting at 2012-08-07 20:32:12 ParseSegment: segment: crawl/crawldb/segments/20120807203131 Parsed (10ms):http://mobile365.ru/ ParseSegment: finished at 2012-08-07 20:32:20, elapsed: 00:00:07 % nutch updatedb crawl/crawldb/ $s1 CrawlDb update: starting at 2012-08-07 20:32:24 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/crawldb/segments/20120807203131] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2012-08-07 20:32:38, elapsed: 00:00:13 # see whether the outlink is now in crawldb: % nutch readdb crawl/crawldb/ -stats CrawlDb statistics start: crawl/crawldb/ Statistics for CrawlDb: crawl/crawldb/ TOTAL urls: 2 retry 0:2 min score: 1.0 avg score: 1.0 max score: 1.0 status 1 (db_unfetched):1 status 2 (db_fetched): 1 CrawlDb statistics: done # = yes: http://mobile365.ru/ is fetched, outlink found %nutch generate crawl/crawldb crawl/crawldb/segments -adddays 0 Generator: starting at 2012-08-07 20:32:58 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/crawldb/segments/20120807203307 Generator: finished at 2012-08-07 20:33:14, elapsed: 00:00:15 % s1=`ls -d crawl/crawldb/segments/* | tail -1` % nutch fetch $s1 Fetcher: starting at 2012-08-07 20:33:34 Fetcher: segment: crawl/crawldb/segments/20120807203307 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost fetching http://www.mobile365.ru/test.html # got it On 08/07/2012 04:37 PM, Alexei Korolev wrote: Hi, I made simple example Put in seed.txt http://mobile365.ru It will produce error. 20120807160035 Put in seed.txt http://www.mobile365.ru and second launch of crawler script will work fine and fetch http://www.mobile365.ru/test.html page. On Tue, Aug 7, 2012 at 6:23 PM, Mathijs Homminga
Re: crawling site without www
http:// ? hth On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev alexei.koro...@gmail.com wrote: Hello, I have small script $NUTCH_PATH inject crawl/crawldb seed.txt $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0 s1=`ls -d crawl/crawldb/segments/* | tail -1` $NUTCH_PATH fetch $s1 $NUTCH_PATH parse $s1 $NUTCH_PATH updatedb crawl/crawldb $s1 In seed.txt I have just one site, for example test.com. When I start script it falls on fetch phase. If I change test.com on www.test.com it works fine. Seems the reason, that outgoing link on test.com all have www. prefix. What I need to change in nutch config for work with test.com? Thank you in advance. I hope my explanation is clear :) -- Alexei A. Korolev -- Lewis
Re: crawling site without www
What do you mean exactly with it falls on fetch phase? Do you get an error? Does test.com exist? Does it perhaps redirect to www.test.com? ... Mathijs On Aug 4, 2012, at 17:11 , Alexei Korolev alexei.koro...@gmail.com wrote: yes On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: http:// ? hth On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev alexei.koro...@gmail.com wrote: Hello, I have small script $NUTCH_PATH inject crawl/crawldb seed.txt $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0 s1=`ls -d crawl/crawldb/segments/* | tail -1` $NUTCH_PATH fetch $s1 $NUTCH_PATH parse $s1 $NUTCH_PATH updatedb crawl/crawldb $s1 In seed.txt I have just one site, for example test.com. When I start script it falls on fetch phase. If I change test.com on www.test.com it works fine. Seems the reason, that outgoing link on test.com all have www. prefix. What I need to change in nutch config for work with test.com? Thank you in advance. I hope my explanation is clear :) -- Alexei A. Korolev -- Lewis -- Alexei A. Korolev
Re: crawling site without www
Hi Alexei, Because users are lazy some browser automatically try to add the www (and other stuff) to escape from a server not found error, see http://www-archive.mozilla.org/docs/end-user/domain-guessing.html Nutch does no domain guessing. The urls have to be correct and the host name must be complete. Finally, even if test.com sends a HTTP redirect pointing to www.test.com : check your URL filters whether both hosts are accepted. Sebastian On 08/04/2012 05:33 PM, Mathijs Homminga wrote: What do you mean exactly with it falls on fetch phase? Do you get an error? Does test.com exist? Does it perhaps redirect to www.test.com? ... Mathijs On Aug 4, 2012, at 17:11 , Alexei Korolev alexei.koro...@gmail.com wrote: yes On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: http:// ? hth On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev alexei.koro...@gmail.com wrote: Hello, I have small script $NUTCH_PATH inject crawl/crawldb seed.txt $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0 s1=`ls -d crawl/crawldb/segments/* | tail -1` $NUTCH_PATH fetch $s1 $NUTCH_PATH parse $s1 $NUTCH_PATH updatedb crawl/crawldb $s1 In seed.txt I have just one site, for example test.com. When I start script it falls on fetch phase. If I change test.com on www.test.com it works fine. Seems the reason, that outgoing link on test.com all have www. prefix. What I need to change in nutch config for work with test.com? Thank you in advance. I hope my explanation is clear :) -- Alexei A. Korolev -- Lewis -- Alexei A. Korolev