This is all great news. It sounds like Nutch 2.0 is more of what I am looking for.
Any idea on the timeline for the first nutch 2 release? I would love to get involved. Are the 19 unresolved issues for the 2.0 release in JIRA the only things that need to be tackled? I can move this conversation over to the DEV mailing list if you would like. Thanks!!!!!!!1!!! Jeremy On Thu, Feb 24, 2011 at 5:40 PM, Markus Jelsma <[email protected]> wrote: >> Thanks for the reply Mark. >> >> So this means Nutch is really only going to be used for crawling now? >> Are there any plans for a JSON/XML RPC interface to using Nutch like >> Solr supports? > > Yes, Nutch is going to focus to the fetch and parse jobs. Andrzej was working > on a REST interface to control these jobs. This is part of 2.0. > >> >> I am interested in a tight app integration where I can easily start >> crawls of new sites, and add/remove things from the index quickly. I >> guess I can rely directly on Solr for adding/removing from the index >> as well, or would you recommend this going through nutch? > > Removing items from the index can be forced from Solr and Nutch. Solr provides > easy methods to remove documents or documents that are the result of some > query. Nutch can deduplicate (1.2+ and 2.0) and possibly remove 404 pages (1.3 > and 2.0) but the latter is not committed. > >> >> >> Thanks, >> Jeremy >> >> On Thu, Feb 24, 2011 at 12:23 PM, Markus Jelsma >> >> <[email protected]> wrote: >> > Hi Jeremy, >> > >> > Nutch' own search server is in the process of being deprecated, Nutch 1.2 >> > was the last release to provide the search server. Please consider using >> > Apache Solr as your search server. >> > >> > Cheers, >> > >> >> I recently installed Nutch and have spent some time trying to get it >> >> working with limited success. >> >> >> >> ./nutch crawl urls -dir crawl -depth 5 -topN 50 >> >> >> >> After the crawl completes I am trying to run the web frontend with the >> >> following command: >> >> >> >> ./nutch server 8080 crawl >> >> >> >> The server seems to be running (no output on the command line), but >> >> when I hit localhost:8080 I get a Error 324 (net::ERR_EMPTY_RESPONSE): >> >> Unknown error. Any ideas on how to get past this? >> >> >> >> I've been using this tutorial to get started. >> >> http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine >> >> >> >> >> >> Thanks, >> >> Jeremy >

