process in /tmp
folder.
see Thu, 07 Feb, 14:12
http://mail-archives.apache.org/mod_mbox/nutch-user/201302.mbox/%3c3e2fc3ad-f049-4091-9ebf-9e624fb18...@ucimail3.uci.cu%3E
- Mensaje original -
De: Alexei Korolev alexei.koro...@gmail.com
Para: user@nutch.apache.org
Enviados: Lunes
Hi,
Yes
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/md2 1065281580 592273404 419321144 59% /
udev 8177228 8 8177220 1% /dev
tmpfs 3274592 328 3274264 1% /run
none 5120 0
this thread
http://mail-archives.apache.org/mod_mbox/nutch-user/201302.mbox/browser
- Mensaje original -
De: Alexei Korolev alexei.koro...@gmail.com
Para: user@nutch.apache.org
Enviados: Lunes, 11 de Febrero 2013 3:40:06
Asunto: Re: DiskChecker$DiskErrorException
Hi,
Yes
Filesystem
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
fetching http://www.mobile365.ru/test.html
# got it
On 08/07/2012 04:37 PM, Alexei Korolev wrote:
Hi,
I made simple example
Put in seed.txt
http
You can use the HostURLNormalizer for this task or just crawl the www OR
the non-www, not both.
I'm trying to crawl only version without www. As I see, I can remove www.
using proper configured regex-normalize.xml.
But will it work if mobile365.ru redirect on www.mobile365.ru (it's very
common
So I see just one solution for crawling limited count of sites with
behaviour like on mobile365. Its limit scope of sites using
regex-urlfilter.txt with list like this
+^www.mobile365.ru
+^mobile365.ru
Thanks.
On Wed, Aug 8, 2012 at 5:56 PM, Markus Jelsma markus.jel...@openindex.iowrote:
If
Ok. Thank you a lot. I'll try later :)
On Wed, Aug 8, 2012 at 9:18 PM, Sebastian Nagel
wastl.na...@googlemail.comwrote:
Hi Alexei,
So I see just one solution for crawling limited count of sites with
behaviour like on mobile365. Its limit scope of sites using
regex-urlfilter.txt with list
Korolev alexei.koro...@gmail.com
wrote:
yes
On Sat, Aug 4, 2012 at 6:11 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
http:// ?
hth
On Fri, Aug 3, 2012 at 9:53 AM, Alexei Korolev
alexei.koro...@gmail.com
wrote:
Hello,
I have small script
$NUTCH_PATH
:02 , Alexei Korolev alexei.koro...@gmail.com
wrote:
Hello,
Yes, test.com and www.test.com exist.
test.com do not redirect on www.test.com, it opens page with ongoing
link
with www. like www.test.com/page1 www.test.com/page2
First launch of crawler script
root@Ubuntu-1110
9 matches
Mail list logo