Hi
I have tried ur suggestion of lowering db.max.outlinks.per.page to a
smaller number. I could not reparse the segment as the segment was
already parsed... I tried modifying some other variables such as
java_heap memory and mapreduce_child_opts values... modifying these
values triggered some
OK. What heapsize did you specify for this job? Could it be that you are
running out of ram and GCing a lot? Still it should not take THAT long
Can you see some variations in the stacktraces or are they always pointing
at the same things?
The operations on the metadata take an awful lot of time,
I can see that its running out of ram because... before starting
updatedb process i have approximately 7.7gb left on the system and as
soon as this starts running for some time.. the ram comes to ~48
bytes...
definitely its clogging all the ram space...
i specified the heap size to be 9 gb.. in
OK. Try reparsing and set a lower value to *db.max.outlinks.per.page*. I am
pretty sure that you are running out of memory because of the inlinks which
are stored in RAM.
Applying the patch NUTCH-702 would also help.
I have modified the CrawlDBReducer and added another parameter *db
Kalaimathan Mahenthiran wrote:
I forgot to add the detail...
The segment i'm trying to do updatedb on has 1.3 millions urls fetched
and 1.08 million urls parsed..
Any help related to this would be appreciated...
On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran
matha...@gmail.com
Thanks for all the replies...
Okay, I think there seems to be some issue too...
I'm running nutch out of the box.. using nutch release 1.0... I
running this in local mode..
The number of reduce tasks.. is the default configured by nutch...
The db size is approximately 860 mb..
i know the
Hi again
i know the process is not stuck.. and the process is running because i
turned on the hadoop logs and i can see logs being written to it...
I'm not sure how to check if the task is completely stuck or not...
run jps to identify the process id then *jstack id* several times to see if
I have lot of space left on the /tmp . I don't have separate partition
for /tmp... i have a folder called /tmp... There is lot of space
left.. close to 1.3Terabytes...
1.4T 55G 1.3T 5% /
tmpfs 3.8G 0 3.8G 0% /lib/init/rw
varrun3.8G
hi everyone
I'm using nutch 1.0. I have fetched successfully and currently on the
updatedb process. I'm doing updatedb and its taking so long. I don't
know why its taking this long. I have a new machine with quad core
processor and 8 gb of ram.
I believe this system is really good in terms of
I forgot to add the detail...
The segment i'm trying to do updatedb on has 1.3 millions urls fetched
and 1.08 million urls parsed..
Any help related to this would be appreciated...
On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran
matha...@gmail.com wrote:
hi everyone
I'm using nutch
10 matches
Mail list logo