Re: updatedb is talking long long time

2009-11-06 Thread Kalaimathan Mahenthiran
Hi I have tried ur suggestion of lowering db.max.outlinks.per.page to a smaller number. I could not reparse the segment as the segment was already parsed... I tried modifying some other variables such as java_heap memory and mapreduce_child_opts values... modifying these values triggered some

Re: updatedb is talking long long time

2009-11-03 Thread Julien Nioche
OK. What heapsize did you specify for this job? Could it be that you are running out of ram and GCing a lot? Still it should not take THAT long Can you see some variations in the stacktraces or are they always pointing at the same things? The operations on the metadata take an awful lot of time,

Re: updatedb is talking long long time

2009-11-03 Thread Kalaimathan Mahenthiran
I can see that its running out of ram because... before starting updatedb process i have approximately 7.7gb left on the system and as soon as this starts running for some time.. the ram comes to ~48 bytes... definitely its clogging all the ram space... i specified the heap size to be 9 gb.. in

Re: updatedb is talking long long time

2009-11-03 Thread Julien Nioche
OK. Try reparsing and set a lower value to *db.max.outlinks.per.page*. I am pretty sure that you are running out of memory because of the inlinks which are stored in RAM. Applying the patch NUTCH-702 would also help. I have modified the CrawlDBReducer and added another parameter *db

Re: updatedb is talking long long time

2009-11-02 Thread Andrzej Bialecki
Kalaimathan Mahenthiran wrote: I forgot to add the detail... The segment i'm trying to do updatedb on has 1.3 millions urls fetched and 1.08 million urls parsed.. Any help related to this would be appreciated... On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran matha...@gmail.com

Re: updatedb is talking long long time

2009-11-02 Thread Kalaimathan Mahenthiran
Thanks for all the replies... Okay, I think there seems to be some issue too... I'm running nutch out of the box.. using nutch release 1.0... I running this in local mode.. The number of reduce tasks.. is the default configured by nutch... The db size is approximately 860 mb.. i know the

Re: updatedb is talking long long time

2009-11-02 Thread Julien Nioche
Hi again i know the process is not stuck.. and the process is running because i turned on the hadoop logs and i can see logs being written to it... I'm not sure how to check if the task is completely stuck or not... run jps to identify the process id then *jstack id* several times to see if

Re: updatedb is talking long long time

2009-11-02 Thread Kalaimathan Mahenthiran
I have lot of space left on the /tmp . I don't have separate partition for /tmp... i have a folder called /tmp... There is lot of space left.. close to 1.3Terabytes... 1.4T 55G 1.3T 5% / tmpfs 3.8G 0 3.8G 0% /lib/init/rw varrun3.8G

updatedb is talking long long time

2009-11-01 Thread Kalaimathan Mahenthiran
hi everyone I'm using nutch 1.0. I have fetched successfully and currently on the updatedb process. I'm doing updatedb and its taking so long. I don't know why its taking this long. I have a new machine with quad core processor and 8 gb of ram. I believe this system is really good in terms of

Re: updatedb is talking long long time

2009-11-01 Thread Kalaimathan Mahenthiran
I forgot to add the detail... The segment i'm trying to do updatedb on has 1.3 millions urls fetched and 1.08 million urls parsed.. Any help related to this would be appreciated... On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran matha...@gmail.com wrote: hi everyone I'm using nutch