Web Service on Nutch

2010-04-25 Thread Kim Theng Chong
Hi All,

I would like to build web service based on Nutch (I had followed the wiki 
tutorial on Run Nutch in Eclipse1.0 and plug in tutorial so I had done some 
customization on Nutch crawl and Nutch search, and I would like to build these 
as web service. 

However, when I was trying to build it (with EJB 3.0), some error messages were 
given such as
java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not 
found. So may I know is there any tutorial can guide me through to build web 
service for Nutch or anyone could help me on this?

Thanks.

Warm regards,
Kim


  

How to do faceting on data indexed by Nutch

2010-04-25 Thread KK
Hi All,
I might be repeating this question asked by someone else but googling didn't
help tracking any such mail responses.
I'm pretty much aware of Solr/Lucene and its basic architecture. I've done
hit highlighting in Lucene, has idea on faceting support by Solr but never
tried it actually. I wanted to implement faceting on Nutch's indexed data. I
already have some MBs of data already indexed by Nutch. I just want to
implement faceting on those . Can someone give me pointers on how to proceed
further in this regard. Or is it the case that I've to query using Solr
interface and redirect all the queries to the index already created by
Nutch. What is the best possible way, simplest way for achieving the same.
Please help in this regard.


Thanks,
KK.


Re: How to do faceting on data indexed by Nutch

2010-04-25 Thread Andrzej Bialecki
On 2010-04-25 15:03, KK wrote:
 Hi All,
 I might be repeating this question asked by someone else but googling didn't
 help tracking any such mail responses.
 I'm pretty much aware of Solr/Lucene and its basic architecture. I've done
 hit highlighting in Lucene, has idea on faceting support by Solr but never
 tried it actually. I wanted to implement faceting on Nutch's indexed data. I
 already have some MBs of data already indexed by Nutch. I just want to
 implement faceting on those . Can someone give me pointers on how to proceed
 further in this regard. Or is it the case that I've to query using Solr
 interface and redirect all the queries to the index already created by
 Nutch. What is the best possible way, simplest way for achieving the same.
 Please help in this regard.

Nutch has two indexing/searching backends - the one that is configured
by default uses plain Lucene, and it does not support faceting. The
other backend uses Solr, and then of course it supports faceting and all
other Solr features.

So in your case you need to switch to use Solr indexing (and searching).

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Separate Nutch(crawl) and Lucene (index/search)

2010-04-25 Thread sb101h

I have a requirement where I want to index and search file system contents
(my local server contents), and at the same time crawl a select set of
web-sites on the same search query.

I have search for my local file system implemented through Lucene. I would
like to have Nutch just crawl the web-sites and produce content, so that my
Lucene search application could index and search the web content as well. I
would like to use standalone Lucene for index/search of web-content also
because I want to use same analyzer across the two and have more control on
the search results like, say, apply different boosts to local content vs
web-content. I want to use Nutch code for crawling and retrieving web-links
of search results, but I want to do indexing/searching/analysis using Lucene
itself.

Is there a solution where only the crawling part of Nutch is taken and is
integrated with Lucene?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Separate-Nutch-crawl-and-Lucene-index-search-tp747841p747841.html
Sent from the Nutch - User mailing list archive at Nabble.com.


[VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-25 Thread Mattmann, Chris A (388J)
Hi Folks,

I have posted an updated candidate for the Apache Nutch 1.1 release. The
source code is at:

http://people.apache.org/~mattmann/apache-nutch-1.1/rc2/

The major difference between this release and rc #1 is the application of
NUTCH-812 - Crawl.java incorrectly uses the Generator API resulting in NPE -
as well as some commits by Sami Siren to fix missing ASL license headers.

For more detailed information, see the included CHANGES.txt file for details
on release contents and latest changes. The release was made using the Nutch
release process, documented on the Wiki here:

http://bit.ly/d5ugid

A Nutch 1.1 tag is at:

http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/

note
There was a request by Sami Siren that the tutorial be updated to reflect
the fact that this release is a source-only release, as well as a request to
integrate RAT into the build, however, in the interest of getting this 1.1
out and getting going on the Nutch TLP, my proposal is:

* update the docs independent of this release (the tutorial as it exists
right now says 0.7 on it anyways and doesn't look like it's been updated in
a while, so I think users can live with what's there and support on
u...@nutch.apache.org or d...@nutch.apache.org until it's updated)

* begin source only releases in general since we've long had the debate as
to the size of the Nutch release. Most folks that use Nutch are likely
familiar with running ant IMHO.

* run RAT and integrate into the build

/note

Please vote on releasing these packages as Apache Nutch 1.1. The vote is
open for the next 72 hours.

Since Nutch is now a TLP and has its own PMC, there is a question of who are
the binding release VOTES in this particular thread. My gut reaction is that
since I started this release while we were under the Lucene PMC, for
continuity purposes, only votes from Lucene PMC are binding, but everyone
(especially newly minted Nutch PMC members!) are  welcome to check the
release candidate and voice their approval or disapproval. The vote passes
if at least three binding +1 votes are cast.

[ ] +1 Release the packages as Apache Nutch 1.1.

[ ] -1 Do not release the packages because...

Thanks!

Cheers,
Chris

P.S. Here is my +1.

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++