Support for Sitemap Protocol and Canonical URLs
I'm teaching a search engine course for CS undergrads, and we'd like to make a contribution to Nutch. It appears that Nutch does not support the Sitemap Protocol (NUTCH-158). http://sitemaps.org/ So I wanted to check with you all and see if this is something you think would make a good addition. Also, do you think this would be a good project for a team of 3 undergrad students who need to complete it within 2-3 weeks? Being only modestly familiar with the codebase myself, I don't want to assign a project that would be too difficult or overwhelming for undergraduates who are newbies and have only been writing Java code for a few semesters. Also you may have heard of the new rel=canonical attribute which is now being supported by Google, Yahoo, and Live: http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html I'd like my students to modify Nutch to support this new attribute as well. After I get some feedback, I'll submit a request to JIRA. I was wondering though, would it be better to submit it as an issue for 0.9, 1.0, or 1.1? Thanks, Frank -- Frank McCown, Ph.D. Assistant Professor of Computer Science Harding University http://www.harding.edu/fmccown/
Build failed in Hudson: Nutch-trunk #727
See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/727/changes -- started Building remotely on lucene.zones.apache.org (Solaris 10) ERROR: svn: timed out waiting for server svn: OPTIONS request failed on '/repos/asf/lucene/nutch/trunk' org.tmatesoft.svn.core.SVNException: svn: timed out waiting for server svn: OPTIONS request failed on '/repos/asf/lucene/nutch/trunk' at org.tmatesoft.svn.core.internal.wc.SVNErrorManager.error(SVNErrorManager.java:103) at org.tmatesoft.svn.core.internal.wc.SVNErrorManager.error(SVNErrorManager.java:87) at org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection.request(HTTPConnection.java:601) at org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection.request(HTTPConnection.java:257) at org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection.request(HTTPConnection.java:245) at org.tmatesoft.svn.core.internal.io.dav.DAVConnection.exchangeCapabilities(DAVConnection.java:454) at org.tmatesoft.svn.core.internal.io.dav.DAVConnection.open(DAVConnection.java:97) at org.tmatesoft.svn.core.internal.io.dav.DAVRepository.openConnection(DAVRepository.java:664) at org.tmatesoft.svn.core.internal.io.dav.DAVRepository.testConnection(DAVRepository.java:96) at hudson.scm.SubversionSCM$DescriptorImpl.checkRepositoryPath(SubversionSCM.java:1344) at hudson.scm.SubversionSCM.repositoryLocationsExist(SubversionSCM.java:1410) at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:382) at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:345) at hudson.model.AbstractProject.checkout(AbstractProject.java:666) at hudson.model.AbstractBuild$AbstractRunner.checkout(AbstractBuild.java:264) at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:238) at hudson.model.Run.run(Run.java:823) at hudson.model.Build.run(Build.java:88) at hudson.model.ResourceController.execute(ResourceController.java:70) at hudson.model.Executor.run(Executor.java:90) Caused by: java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:519) at org.tmatesoft.svn.core.internal.util.SVNSocketFactory.createPlainSocket(SVNSocketFactory.java:53) at org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection.connect(HTTPConnection.java:167) at org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection.request(HTTPConnection.java:307) ... 17 more Publishing Javadoc Recording test results
Re: Support for Sitemap Protocol and Canonical URLs
Frank McCown wrote: I'm teaching a search engine course for CS undergrads, and we'd like to make a contribution to Nutch. It appears that Nutch does not support the Sitemap Protocol (NUTCH-158). http://sitemaps.org/ Correct. So I wanted to check with you all and see if this is something you think would make a good addition. Also, do you think this would be a good project for a team of 3 undergrad students who need to complete it within 2-3 weeks? Being only modestly familiar with the codebase myself, I don't want to assign a project that would be too difficult or overwhelming for undergraduates who are newbies and have only been writing Java code for a few semesters. I think it would be a welcome addition. The question is more about whether the students are prepared to go through a few rounds of review and polishing the code so that it's fit for committing. Also you may have heard of the new rel=canonical attribute which is now being supported by Google, Yahoo, and Live: http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html I'd like my students to modify Nutch to support this new attribute as well. This sounds like a useful addition, too. One important note: we are in the process of re-thinking the Nutch architecture, so it's likely that after 1.0 release is out the door we will concentrate on a heavy redesign. For this reason it would best if this new functionality could be well separated from existing classes, e.g.in utility classes, or in an extension point that other existing Nutch classes can use. After I get some feedback, I'll submit a request to JIRA. I was wondering though, would it be better to submit it as an issue for 0.9, 1.0, or 1.1? 1.1. We are putting final touches to 1.0, and new development will happen only on the trunk. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com