Hello,
I wondered if there is a way of adding to solrindex made from nutch segments another solrindex also made from nutch segments. I have to index about 3000 domains but 5 of them are newspaper sites. So, I need to crawl-fetch-parse these 5 domains(with depth 2) and update index every day or so. The rest is crawled and indexed once a month. Thanks. Alex. -----Original Message----- From: Markus Jelsma <[email protected]> To: Jeremy Arnold <[email protected]> Cc: user <[email protected]> Sent: Thu, Feb 24, 2011 3:46 pm Subject: Re: Starting web frontend > Thanks for the reply Mark. > > So this means Nutch is really only going to be used for crawling now? > Are there any plans for a JSON/XML RPC interface to using Nutch like > Solr supports? Yes, Nutch is going to focus to the fetch and parse jobs. Andrzej was working on a REST interface to control these jobs. This is part of 2.0. > > I am interested in a tight app integration where I can easily start > crawls of new sites, and add/remove things from the index quickly. I > guess I can rely directly on Solr for adding/removing from the index > as well, or would you recommend this going through nutch? Removing items from the index can be forced from Solr and Nutch. Solr provides easy methods to remove documents or documents that are the result of some query. Nutch can deduplicate (1.2+ and 2.0) and possibly remove 404 pages (1.3 and 2.0) but the latter is not committed. > > > Thanks, > Jeremy > > On Thu, Feb 24, 2011 at 12:23 PM, Markus Jelsma > > <[email protected]> wrote: > > Hi Jeremy, > > > > Nutch' own search server is in the process of being deprecated, Nutch 1.2 > > was the last release to provide the search server. Please consider using > > Apache Solr as your search server. > > > > Cheers, > > > >> I recently installed Nutch and have spent some time trying to get it > >> working with limited success. > >> > >> ./nutch crawl urls -dir crawl -depth 5 -topN 50 > >> > >> After the crawl completes I am trying to run the web frontend with the > >> following command: > >> > >> ./nutch server 8080 crawl > >> > >> The server seems to be running (no output on the command line), but > >> when I hit localhost:8080 I get a Error 324 (net::ERR_EMPTY_RESPONSE): > >> Unknown error. Any ideas on how to get past this? > >> > >> I've been using this tutorial to get started. > >> http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine > >> > >> > >> Thanks, > >> Jeremy

