Re: DiskChecker$DiskErrorException

2013-03-04 Thread Alexei Korolev
process in /tmp folder. see Thu, 07 Feb, 14:12 http://mail-archives.apache.org/mod_mbox/nutch-user/201302.mbox/%3c3e2fc3ad-f049-4091-9ebf-9e624fb18...@ucimail3.uci.cu%3E - Mensaje original - De: Alexei Korolev alexei.koro...@gmail.com Para: user@nutch.apache.org Enviados: Lunes

Re: DiskChecker$DiskErrorException

2013-02-11 Thread Alexei Korolev
Hi, Yes Filesystem 1K-blocks Used Available Use% Mounted on /dev/md2 1065281580 592273404 419321144 59% / udev 8177228 8 8177220 1% /dev tmpfs 3274592 328 3274264 1% /run none 5120 0

Re: DiskChecker$DiskErrorException

2013-02-11 Thread Alexei Korolev
this thread http://mail-archives.apache.org/mod_mbox/nutch-user/201302.mbox/browser - Mensaje original - De: Alexei Korolev alexei.koro...@gmail.com Para: user@nutch.apache.org Enviados: Lunes, 11 de Febrero 2013 3:40:06 Asunto: Re: DiskChecker$DiskErrorException Hi, Yes Filesystem

Re: crawling site without www

2012-08-08 Thread Alexei Korolev
Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost fetching http://www.mobile365.ru/test.html # got it On 08/07/2012 04:37 PM, Alexei Korolev wrote: Hi, I made simple example Put in seed.txt http

Re: crawling site without www

2012-08-08 Thread Alexei Korolev
You can use the HostURLNormalizer for this task or just crawl the www OR the non-www, not both. I'm trying to crawl only version without www. As I see, I can remove www. using proper configured regex-normalize.xml. But will it work if mobile365.ru redirect on www.mobile365.ru (it's very common

Re: crawling site without www

2012-08-08 Thread Alexei Korolev
So I see just one solution for crawling limited count of sites with behaviour like on mobile365. Its limit scope of sites using regex-urlfilter.txt with list like this +^www.mobile365.ru +^mobile365.ru Thanks. On Wed, Aug 8, 2012 at 5:56 PM, Markus Jelsma markus.jel...@openindex.iowrote: If

Re: crawling site without www

2012-08-08 Thread Alexei Korolev
Ok. Thank you a lot. I'll try later :) On Wed, Aug 8, 2012 at 9:18 PM, Sebastian Nagel wastl.na...@googlemail.comwrote: Hi Alexei, So I see just one solution for crawling limited count of sites with behaviour like on mobile365. Its limit scope of sites using regex-urlfilter.txt with list

Re: crawling site without www

2012-08-07 Thread Alexei Korolev
Korolev alexei.koro...@gmail.com wrote: yes On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: http:// ? hth On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev alexei.koro...@gmail.com wrote: Hello, I have small script $NUTCH_PATH

Re: crawling site without www

2012-08-07 Thread Alexei Korolev
:02 , Alexei Korolev alexei.koro...@gmail.com wrote: Hello, Yes, test.com and www.test.com exist. test.com do not redirect on www.test.com, it opens page with ongoing link with www. like www.test.com/page1 www.test.com/page2 First launch of crawler script root@Ubuntu-1110