Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-15 Thread Michael Wechner
Doug Cutting wrote: http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch.html well, I think incrediBILL has an argument, that people might really start excluding bots from their servers if it's becoming too much. What might help is that incrediBILL would offer an index

search speed

2006-06-15 Thread anton
I using dfs. My index contain 3706249 documents. Presently, searching for occupies from 2 before 4 seconds (I test on query with 3 search term). Tomcat started on box with cpu Dual Opteron 2.4 GHz and 16 GB Ram. I think search is very slow now. We can make search faster? What factors influence

RE: search speed

2006-06-15 Thread Gal Nitzan
Hi, DFS is too slow for the search. What we did, was extracted the segments to the local FS i.e. to the hard disk. Each machine has 2X300GB HD in raid. Bin/hadoop dfs -get index /nutch/index Bin/hadoop dfs -get linkdb /nutch/linkdb Bin/hadoop dfs -get segments /nutch/segments When we run out

RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-15 Thread Gal Nitzan
In my company we changed the default and many other probably did the same. However, we must not ignore the behavior of the irresponsible users of Nutch. And for that reason the use of the default must be blocked in code. Just my 2 cents. -Original Message- From: Michael Wechner

[jira] Assigned: (NUTCH-306) DistributedSearch.Client liveAddresses concurrency problem

2006-06-15 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-306?page=all ] Sami Siren reassigned NUTCH-306: Assign To: Sami Siren DistributedSearch.Client liveAddresses concurrency problem -- Key:

[jira] Resolved: (NUTCH-122) block numbers need a better random number generator

2006-06-15 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-122?page=all ] Sami Siren resolved NUTCH-122: -- Resolution: Invalid this is more related to hadoop block numbers need a better random number generator ---

[jira] Closed: (NUTCH-187) Cannot start Nutch datanodes on Windows outside of a cygwin environment because of DF

2006-06-15 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-187?page=all ] Sami Siren closed NUTCH-187: Resolution: Won't Fix closed as requested Cannot start Nutch datanodes on Windows outside of a cygwin environment because of DF

RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-15 Thread Paul Sutter
I think that Nutch has to solve the problem: if you leave the problem to the websites, they're more likely to cut you off than they are to implement their own index storage scheme. Besides, they'd get it wrong, have stale data, etc. Maybe what is needed is brainstorming on a shared crawling

[jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-15 Thread Chris A. Mattmann (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12416379 ] Chris A. Mattmann commented on NUTCH-258: - Thanks for this patch Chris - even if now it is outdate by NUTCH-303 :-( Since Nutch no more use the deprecated Hadoop