Support for Sitemap Protocol and Canonical URLs

2009-02-16 Thread Frank McCown
I'm teaching a search engine course for CS undergrads, and we'd like
to make a contribution to Nutch.  It appears that Nutch does not
support the Sitemap Protocol (NUTCH-158).

http://sitemaps.org/

So I wanted to check with you all and see if this is something you
think would make a good addition.  Also, do you think this would be a
good project for a team of 3 undergrad students who need to complete
it within 2-3 weeks?  Being only modestly familiar with the codebase
myself, I don't want to assign a project that would be too difficult
or overwhelming for undergraduates who are newbies and have only been
writing Java code for a few semesters.

Also you may have heard of the new rel=canonical attribute which is
now being supported by Google, Yahoo, and Live:

http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html

I'd like my students to modify Nutch to support this new attribute as well.

After I get some feedback, I'll submit a request to JIRA.  I was
wondering though, would it be better to submit it as an issue for 0.9,
1.0, or 1.1?

Thanks,
Frank

-- 
Frank McCown, Ph.D.
Assistant Professor of Computer Science
Harding University
http://www.harding.edu/fmccown/


Build failed in Hudson: Nutch-trunk #727

2009-02-16 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/727/changes

--
started
Building remotely on lucene.zones.apache.org (Solaris 10)
ERROR: svn: timed out waiting for server
svn: OPTIONS request failed on '/repos/asf/lucene/nutch/trunk'
org.tmatesoft.svn.core.SVNException: svn: timed out waiting for server
svn: OPTIONS request failed on '/repos/asf/lucene/nutch/trunk'
at 
org.tmatesoft.svn.core.internal.wc.SVNErrorManager.error(SVNErrorManager.java:103)
at 
org.tmatesoft.svn.core.internal.wc.SVNErrorManager.error(SVNErrorManager.java:87)
at 
org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection.request(HTTPConnection.java:601)
at 
org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection.request(HTTPConnection.java:257)
at 
org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection.request(HTTPConnection.java:245)
at 
org.tmatesoft.svn.core.internal.io.dav.DAVConnection.exchangeCapabilities(DAVConnection.java:454)
at 
org.tmatesoft.svn.core.internal.io.dav.DAVConnection.open(DAVConnection.java:97)
at 
org.tmatesoft.svn.core.internal.io.dav.DAVRepository.openConnection(DAVRepository.java:664)
at 
org.tmatesoft.svn.core.internal.io.dav.DAVRepository.testConnection(DAVRepository.java:96)
at 
hudson.scm.SubversionSCM$DescriptorImpl.checkRepositoryPath(SubversionSCM.java:1344)
at 
hudson.scm.SubversionSCM.repositoryLocationsExist(SubversionSCM.java:1410)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:382)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:345)
at hudson.model.AbstractProject.checkout(AbstractProject.java:666)
at 
hudson.model.AbstractBuild$AbstractRunner.checkout(AbstractBuild.java:264)
at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:238)
at hudson.model.Run.run(Run.java:823)
at hudson.model.Build.run(Build.java:88)
at hudson.model.ResourceController.execute(ResourceController.java:70)
at hudson.model.Executor.run(Executor.java:90)
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:519)
at 
org.tmatesoft.svn.core.internal.util.SVNSocketFactory.createPlainSocket(SVNSocketFactory.java:53)
at 
org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection.connect(HTTPConnection.java:167)
at 
org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection.request(HTTPConnection.java:307)
... 17 more
Publishing Javadoc
Recording test results



Re: Support for Sitemap Protocol and Canonical URLs

2009-02-16 Thread Andrzej Bialecki

Frank McCown wrote:

I'm teaching a search engine course for CS undergrads, and we'd like
to make a contribution to Nutch.  It appears that Nutch does not
support the Sitemap Protocol (NUTCH-158).

http://sitemaps.org/


Correct.



So I wanted to check with you all and see if this is something you
think would make a good addition.  Also, do you think this would be a
good project for a team of 3 undergrad students who need to complete
it within 2-3 weeks?  Being only modestly familiar with the codebase
myself, I don't want to assign a project that would be too difficult
or overwhelming for undergraduates who are newbies and have only been
writing Java code for a few semesters.


I think it would be a welcome addition. The question is more about 
whether the students are prepared to go through a few rounds of review 
and polishing the code so that it's fit for committing.



Also you may have heard of the new rel=canonical attribute which is
now being supported by Google, Yahoo, and Live:

http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html

I'd like my students to modify Nutch to support this new attribute as well.


This sounds like a useful addition, too.

One important note: we are in the process of re-thinking the Nutch 
architecture, so it's likely that after 1.0 release is out the door we 
will concentrate on a heavy redesign.


For this reason it would best if this new functionality could be well 
separated from existing classes, e.g.in utility classes, or in an 
extension point that other existing Nutch classes can use.





After I get some feedback, I'll submit a request to JIRA.  I was
wondering though, would it be better to submit it as an issue for 0.9,
1.0, or 1.1?


1.1. We are putting final touches to 1.0, and new development will 
happen only on the trunk.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com