Re: crawling site without www

2012-08-08 Thread Alexei Korolev
Hi, Sebastian Seems you are right. I have db.ignore.external.links is true. But how to configure nutch for processing mobile365.ru and www.mobile365 as single site? Thanks. On Tue, Aug 7, 2012 at 10:58 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi Alexei, I tried a crawl with

RE: crawling site without www

2012-08-08 Thread Markus Jelsma
-Original message- From:Alexei Korolev alexei.koro...@gmail.com Sent: Wed 08-Aug-2012 15:43 To: user@nutch.apache.org Subject: Re: crawling site without www Hi, Sebastian Seems you are right. I have db.ignore.external.links is true. But how to configure nutch

Re: crawling site without www

2012-08-08 Thread Alexei Korolev
You can use the HostURLNormalizer for this task or just crawl the www OR the non-www, not both. I'm trying to crawl only version without www. As I see, I can remove www. using proper configured regex-normalize.xml. But will it work if mobile365.ru redirect on www.mobile365.ru (it's very common

RE: crawling site without www

2012-08-08 Thread Markus Jelsma
15:55 To: user@nutch.apache.org Subject: Re: crawling site without www You can use the HostURLNormalizer for this task or just crawl the www OR the non-www, not both. I'm trying to crawl only version without www. As I see, I can remove www. using proper configured regex-normalize.xml

Re: crawling site without www

2012-08-08 Thread Alexei Korolev
To: user@nutch.apache.org Subject: Re: crawling site without www You can use the HostURLNormalizer for this task or just crawl the www OR the non-www, not both. I'm trying to crawl only version without www. As I see, I can remove www. using proper configured regex

Re: crawling site without www

2012-08-08 Thread Sebastian Nagel
URL's to the host that is being redirected to. -Original message- From:Alexei Korolev alexei.koro...@gmail.com Sent: Wed 08-Aug-2012 15:55 To: user@nutch.apache.org Subject: Re: crawling site without www You can use the HostURLNormalizer for this task or just crawl the www OR the non

Re: crawling site without www

2012-08-08 Thread Alexei Korolev
:55 To: user@nutch.apache.org Subject: Re: crawling site without www You can use the HostURLNormalizer for this task or just crawl the www OR the non-www, not both. I'm trying to crawl only version without www. As I see, I can remove www. using proper configured regex

Re: crawling site without www

2012-08-07 Thread Alexei Korolev
Hello, Yes, test.com and www.test.com exist. test.com do not redirect on www.test.com, it opens page with ongoing link with www. like www.test.com/page1 www.test.com/page2 First launch of crawler script root@Ubuntu-1110-oneiric-64-minimal:/data/nutch# ./crawl.sh Injector: starting at 2012-08-07

Re: crawling site without www

2012-08-07 Thread Mathijs Homminga
Hi, I read from your logs: - test.com is injected. - test.com is fetched and parsed successfully. - but when you run a generate again (second launch), no segment is created (because no url is selected) and your script tries to fetch and parse the first segment again. Hence the errors. So

Re: crawling site without www

2012-08-07 Thread Alexei Korolev
Hi, I made simple example Put in seed.txt http://mobile365.ru It will produce error. Put in seed.txt http://www.mobile365.ru and second launch of crawler script will work fine and fetch http://www.mobile365.ru/test.html page. On Tue, Aug 7, 2012 at 6:23 PM, Mathijs Homminga

Re: crawling site without www

2012-08-07 Thread Sebastian Nagel
Hi Alexei, I tried a crawl with your scrip fragment and Nutch 1.5.1 and the URLs http://mobile365.ru as seed. It worked, see annotated log below. Which version of Nutch do you use? Check the property db.ignore.external.links (default is false). If true the link from mobile365.ru to

Re: crawling site without www

2012-08-04 Thread Lewis John Mcgibbney
http:// ? hth On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev alexei.koro...@gmail.com wrote: Hello, I have small script $NUTCH_PATH inject crawl/crawldb seed.txt $NUTCH_PATH generate crawl/crawldb crawl/crawldb/segments -adddays 0 s1=`ls -d crawl/crawldb/segments/* | tail -1`

Re: crawling site without www

2012-08-04 Thread Mathijs Homminga
What do you mean exactly with it falls on fetch phase? Do you get an error? Does test.com exist? Does it perhaps redirect to www.test.com? ... Mathijs On Aug 4, 2012, at 17:11 , Alexei Korolev alexei.koro...@gmail.com wrote: yes On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney

Re: crawling site without www

2012-08-04 Thread Sebastian Nagel
Hi Alexei, Because users are lazy some browser automatically try to add the www (and other stuff) to escape from a server not found error, see http://www-archive.mozilla.org/docs/end-user/domain-guessing.html Nutch does no domain guessing. The urls have to be correct and the host name must be