Web Service on Nutch
Hi All, I would like to build web service based on Nutch (I had followed the wiki tutorial on Run Nutch in Eclipse1.0 and plug in tutorial so I had done some customization on Nutch crawl and Nutch search, and I would like to build these as web service. However, when I was trying to build it (with EJB 3.0), some error messages were given such as java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found. So may I know is there any tutorial can guide me through to build web service for Nutch or anyone could help me on this? Thanks. Warm regards, Kim
How to do faceting on data indexed by Nutch
Hi All, I might be repeating this question asked by someone else but googling didn't help tracking any such mail responses. I'm pretty much aware of Solr/Lucene and its basic architecture. I've done hit highlighting in Lucene, has idea on faceting support by Solr but never tried it actually. I wanted to implement faceting on Nutch's indexed data. I already have some MBs of data already indexed by Nutch. I just want to implement faceting on those . Can someone give me pointers on how to proceed further in this regard. Or is it the case that I've to query using Solr interface and redirect all the queries to the index already created by Nutch. What is the best possible way, simplest way for achieving the same. Please help in this regard. Thanks, KK.
Re: How to do faceting on data indexed by Nutch
On 2010-04-25 15:03, KK wrote: Hi All, I might be repeating this question asked by someone else but googling didn't help tracking any such mail responses. I'm pretty much aware of Solr/Lucene and its basic architecture. I've done hit highlighting in Lucene, has idea on faceting support by Solr but never tried it actually. I wanted to implement faceting on Nutch's indexed data. I already have some MBs of data already indexed by Nutch. I just want to implement faceting on those . Can someone give me pointers on how to proceed further in this regard. Or is it the case that I've to query using Solr interface and redirect all the queries to the index already created by Nutch. What is the best possible way, simplest way for achieving the same. Please help in this regard. Nutch has two indexing/searching backends - the one that is configured by default uses plain Lucene, and it does not support faceting. The other backend uses Solr, and then of course it supports faceting and all other Solr features. So in your case you need to switch to use Solr indexing (and searching). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Separate Nutch(crawl) and Lucene (index/search)
I have a requirement where I want to index and search file system contents (my local server contents), and at the same time crawl a select set of web-sites on the same search query. I have search for my local file system implemented through Lucene. I would like to have Nutch just crawl the web-sites and produce content, so that my Lucene search application could index and search the web content as well. I would like to use standalone Lucene for index/search of web-content also because I want to use same analyzer across the two and have more control on the search results like, say, apply different boosts to local content vs web-content. I want to use Nutch code for crawling and retrieving web-links of search results, but I want to do indexing/searching/analysis using Lucene itself. Is there a solution where only the crawling part of Nutch is taken and is integrated with Lucene? -- View this message in context: http://lucene.472066.n3.nabble.com/Separate-Nutch-crawl-and-Lucene-index-search-tp747841p747841.html Sent from the Nutch - User mailing list archive at Nabble.com.
[VOTE] Apache Nutch 1.1 Release Candidate #2
Hi Folks, I have posted an updated candidate for the Apache Nutch 1.1 release. The source code is at: http://people.apache.org/~mattmann/apache-nutch-1.1/rc2/ The major difference between this release and rc #1 is the application of NUTCH-812 - Crawl.java incorrectly uses the Generator API resulting in NPE - as well as some commits by Sami Siren to fix missing ASL license headers. For more detailed information, see the included CHANGES.txt file for details on release contents and latest changes. The release was made using the Nutch release process, documented on the Wiki here: http://bit.ly/d5ugid A Nutch 1.1 tag is at: http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/ note There was a request by Sami Siren that the tutorial be updated to reflect the fact that this release is a source-only release, as well as a request to integrate RAT into the build, however, in the interest of getting this 1.1 out and getting going on the Nutch TLP, my proposal is: * update the docs independent of this release (the tutorial as it exists right now says 0.7 on it anyways and doesn't look like it's been updated in a while, so I think users can live with what's there and support on u...@nutch.apache.org or d...@nutch.apache.org until it's updated) * begin source only releases in general since we've long had the debate as to the size of the Nutch release. Most folks that use Nutch are likely familiar with running ant IMHO. * run RAT and integrate into the build /note Please vote on releasing these packages as Apache Nutch 1.1. The vote is open for the next 72 hours. Since Nutch is now a TLP and has its own PMC, there is a question of who are the binding release VOTES in this particular thread. My gut reaction is that since I started this release while we were under the Lucene PMC, for continuity purposes, only votes from Lucene PMC are binding, but everyone (especially newly minted Nutch PMC members!) are welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.1. [ ] -1 Do not release the packages because... Thanks! Cheers, Chris P.S. Here is my +1. ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++