Hi guys any pointer on following.
Your help will highly appreciated .
Thanks
-Pravin
-Original Message-
From: Pravin Karne
Sent: Friday, March 05, 2010 12:57 PM
To: nutch-user@lucene.apache.org
Subject: Two Nutch parallel crawl with two conf folder.
Hi,
I want to do two Nutch parallel
is there any idea guys ??
From: mbel...@msn.com
To: nutch-user@lucene.apache.org
Subject: Content of redirected urls empty
Date: Fri, 5 Mar 2010 22:01:05 +
hi,
the content of my redirected urls is empty...but still have the other
metadata...
i have an http urls that is
How parallel is parallel in your case ?
Don't forget Hadoop in distributed mode will serialize your jobs anyhow.
For the rest why don't you create two Nutch directories and run things
totally independently
2010/3/8, Pravin Karne pravin_ka...@persistent.co.in:
Hi guys any pointer on following.
On 2010-03-08 14:55, BELLINI ADAM wrote:
is there any idea guys ??
From: mbel...@msn.com
To: nutch-user@lucene.apache.org
Subject: Content of redirected urls empty
Date: Fri, 5 Mar 2010 22:01:05 +
hi,
the content of my redirected urls is empty...but still have the other
metadata...
Hi, i'v just dumped my segments and found that i have both 2 URLS, the original
one (HTTP) with an empty content and the REDIRCTED TO or the DESTINATION URL
(HTTPS) with NON EMPTY content !
but in my search i found only the HTTPS URL with an empty content !! logically
the content of the
i'm sorry...i just checked twice...and in my index i have the original URL,
which is the HTTP one with the empty content...but it dosent index the HTTPS
oneand i using solr index
thx
From: mbel...@msn.com
To: nutch-user@lucene.apache.org
Subject: RE: Content of redirected urls empty
Hello Ted,
I ran the command 'ps -aux' and I confirmed that only 1GB was defined.
I adjust NUTCH_HEAPSIZE to 8GB (physical RAM) and ran it again
successfully.
Do you know which parameters need to be adjusted if not enough physical RAM is
available on the server? For example for 2GB RAM.
I ran
Can we share Hadoop cluster between two nutch instance.
So there will be two nutch instance and they will point to same Hadoop cluster.
This way I am able to share my hardware bandwidth. I know that Hadoop in
distributed mode serializes jobs.
But I will not affect my flow. I just want to share
Yes it should work, I personnaly run some tests crawl on the same
hardware, even on the same nutch directory thus I share the conf
directory.
But If you don't want that I would use two nutch directory and of
course two different crawl directory because with hadoop they will
end-up on the same