Re: Starting web frontend

alxsss Thu, 24 Feb 2011 16:43:01 -0800

 

 Hello,


I wondered if there is a way of adding to solrindex made from nutch segments 
another solrindex also made from nutch segments.
I have to index about 3000 domains but 5 of them are newspaper sites. So, I 
need to crawl-fetch-parse these 5 domains(with depth 2) and update index every 
day or so. The rest is crawled and indexed once a month.

Thanks.
Alex.


 

 

-----Original Message-----
From: Markus Jelsma <[email protected]>
To: Jeremy Arnold <[email protected]>
Cc: user <[email protected]>
Sent: Thu, Feb 24, 2011 3:46 pm
Subject: Re: Starting web frontend


> Thanks for the reply Mark.

> 

> So this means Nutch is really only going to be used for crawling now?

> Are there any plans for a JSON/XML RPC interface to using Nutch like

> Solr supports?



Yes, Nutch is going to focus to the fetch and parse jobs. Andrzej was working 

on a REST interface to control these jobs. This is part of 2.0.



> 

> I am interested in a tight app integration where I can easily start

> crawls of new sites, and add/remove things from the index quickly. I

> guess I can rely directly on Solr for adding/removing from the index

> as well, or would you recommend this going through nutch?



Removing items from the index can be forced from Solr and Nutch. Solr provides 

easy methods to remove documents or documents that are the result of some 

query. Nutch can deduplicate (1.2+ and 2.0) and possibly remove 404 pages (1.3 

and 2.0) but the latter is not committed.



> 

> 

> Thanks,

> Jeremy

> 

> On Thu, Feb 24, 2011 at 12:23 PM, Markus Jelsma

> 

> <[email protected]> wrote:

> > Hi Jeremy,

> > 

> > Nutch' own search server is in the process of being deprecated, Nutch 1.2

> > was the last release to provide the search server. Please consider using

> > Apache Solr as your search server.

> > 

> > Cheers,

> > 

> >> I recently installed Nutch and have spent some time trying to get it

> >> working with limited success.

> >> 

> >> ./nutch crawl urls -dir crawl -depth 5 -topN 50

> >> 

> >> After the crawl completes I am trying to run the web frontend with the

> >> following command:

> >> 

> >> ./nutch server 8080 crawl

> >> 

> >> The server seems to be running (no output on the command line), but

> >> when I hit localhost:8080 I get a Error 324 (net::ERR_EMPTY_RESPONSE):

> >> Unknown error. Any ideas on how to get past this?

> >> 

> >> I've been using  this tutorial to get started.

> >> http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine

> >> 

> >> 

> >> Thanks,

> >> Jeremy

Re: Starting web frontend

Reply via email to