Re: what is the difference between nutch and some other opensource search engines
Broad question, broad answer: free, scalable, extensible, open-source are a few characteristics that come to mind. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: minskv [EMAIL PROTECTED] To: nutch-dev nutch-dev@lucene.apache.org Sent: Wednesday, April 9, 2008 2:44:51 PM Subject: what is the difference between nutch and some other opensource search engines and what is the main competitive strength of nutch? 2008-04-10 minskv
Hudson build is back to normal: Nutch-trunk #416
See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/416/changes
[jira] Created: (NUTCH-627) Minimize host address lookup
Minimize host address lookup Key: NUTCH-627 URL: https://issues.apache.org/jira/browse/NUTCH-627 Project: Nutch Issue Type: Improvement Components: generator Reporter: Otis Gospodnetic Attachments: NUTCH-627.patch The simple patch that I'm about to attach keeps track of hosts whose max URLs per host limit we already reached, as well as hosts whose hostname-IP lookup already failed. For such hosts, further DNS lookups are skipped: - there is no point in looking up a hostname yet again if we already have the max number of URLs for that host - there is little point in attempting to look up a hostname yet again if the previous lookup already failed In a simple test, this saved a few hundred thousand lookups for the first case and a few hundred lookups for the second case. If nobody complains, I'll commit by the end of the week. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-500) Add hadoop masters configuration file into conf folder
[ https://issues.apache.org/jira/browse/NUTCH-500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12587480#action_12587480 ] Hudson commented on NUTCH-500: -- Integrated in Nutch-trunk #416 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/416/]) Add hadoop masters configuration file into conf folder -- Key: NUTCH-500 URL: https://issues.apache.org/jira/browse/NUTCH-500 Project: Nutch Issue Type: Improvement Components: ndfs Affects Versions: 0.9.0 Environment: Linux Fedora 7, Java 1.5 Reporter: Emmanuel Joke Assignee: Dennis Kubes Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-500-1-20080331.patch Hadoop scripts read a configuration file named masters to know how many namenode should be started. This file is not in the repository for the moment, thus it generate some errors message (error which is not really important) when we start the cluster. Anyway it could be a good idea to add a template file in the conf directory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-627) Minimize host address lookup
[ https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-627: --- Attachment: NUTCH-627.patch Minimize host address lookup Key: NUTCH-627 URL: https://issues.apache.org/jira/browse/NUTCH-627 Project: Nutch Issue Type: Improvement Components: generator Reporter: Otis Gospodnetic Attachments: NUTCH-627.patch The simple patch that I'm about to attach keeps track of hosts whose max URLs per host limit we already reached, as well as hosts whose hostname-IP lookup already failed. For such hosts, further DNS lookups are skipped: - there is no point in looking up a hostname yet again if we already have the max number of URLs for that host - there is little point in attempting to look up a hostname yet again if the previous lookup already failed In a simple test, this saved a few hundred thousand lookups for the first case and a few hundred lookups for the second case. If nobody complains, I'll commit by the end of the week. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.