[Nutch Wiki] Update of "Nutch - The Java Search Engine" by tyrellperera

2006-06-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by tyrellperera:
http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine

--
  
  The search web application is included in your downloaded Nutch archive. In 
order for the nutch search web application to function properly, it needs to 
know where to find the indexes. We need to map our indexes by editing the 
‘nutch-site.xml’ file.
  
- NOTE: the steps below assume that the 
  
  The steps to follow would be;
  


[Nutch Wiki] Update of "Nutch - The Java Search Engine" by tyrellperera

2006-06-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by tyrellperera:
http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine

--

  === 3.2.2 Edit the file conf/crawl-urlfilter.txt ===
  
- and replace the existing domain name with the name of the domain you wish to 
crawl. For example, if you wished to limit the crawl to the openreach.co.uk 
domain, the line should read:
+ and replace the existing domain name with the name of the domain you wish to 
crawl. For example, if you wished to limit the crawl to the virtusa.com domain, 
the line should read:
  
{{{ +^http://([a-z0-9]*\.)*virtusa.com/ }}}
  
@@ -154, +154 @@

  
  == 3.3 Configuring the Nutch Web Application ==
  
- The search web application is already integrated and deployed along with the 
ORPG application. In order for the nutch search web application to function 
properly, it needs to know where to find the indexes. We need to map our 
indexes by editing the ‘nutch-site.xml’ file.
+ The search web application is included in your downloaded Nutch archive. In 
order for the nutch search web application to function properly, it needs to 
know where to find the indexes. We need to map our indexes by editing the 
‘nutch-site.xml’ file.
  
  NOTE: the steps below assume that the 
  


[Nutch Wiki] Update of "Nutch - The Java Search Engine" by tyrellperera

2006-03-24 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by tyrellperera:
http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine

--
  1. Through the nutch Shell Script for administrative tasks, such as creating 
and maintaining indexes
  [[BR]]2. Through the Search Web Application, in order to perform a search 
using keywords
  
+ The ''sequence diagram'' below shows how each of these components interact in 
implementing a Nutch based search application.
+ 
+ [[BR]]http://static.flickr.com/43/117204451_d634c1d869.jpg
  
  = 3 Implementing a Nutch Search =
  
@@ -178, +181 @@

  
  Now that we have created the indexes and configured the Nutch web 
application, the only thing left is to give it a test run.
  
- Open a browser and type your Tomcat URL (ex: http://localhost:8080)
+ Open a browser and type your Tomcat URL (ex: http://localhost:8080). The 
following page will greet you if the web application is configured properly.
  
+ [[BR]]http://static.flickr.com/42/117204449_d3fe6f8400.jpg
+ 
- Now type a keyword (ex: virtusa) and click search. If the implementation 
works as expected, the following results page will display.
+ [[BR]]Now type a keyword (ex: virtusa) and click search. If the 
implementation works as expected, the following results page will be displayed.
+ 
+ [[BR]]http://static.flickr.com/56/117204450_8317279e3a.jpg
  
  == 3.5 Maintaining Our Index ==
  


[Nutch Wiki] Update of "Nutch - The Java Search Engine" by tyrellperera

2006-03-23 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by tyrellperera:
http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine

--
  The Nutch search engine consists, very roughly, of three components:
  
  1. The Crawler, which discovers and retrieves web pages
- 
- 2. The ‘WebDB’, a custom database that stores known URLs and fetched page 
contents
+ [[BR]]2. The ‘WebDB’, a custom database that stores known URLs and 
fetched page contents
- 
- 3. The ‘Indexer’, which dissects pages and builds keyword-based indexes 
from them
+ [[BR]]3. The ‘Indexer’, which dissects pages and builds keyword-based 
indexes from them
  
- After the initial creation of an Index, it is usual to perform periodic 
updates of the index, in order to keep it up-to-date. We will look into the 
details of index maintenance in the parts following this.
+ (!) After the initial creation of an Index, it is usual to perform periodic 
updates of the index, in order to keep it up-to-date. We will look into the 
details of index maintenance in the parts following this.
  
  == 2.2 The Nutch Web Application ==
  
@@ -61, +59 @@

  All components listed above use the nutch API. The users can utilize the API 
via two approaches, which depends on the task at hand.
  
  1. Through the nutch Shell Script for administrative tasks, such as creating 
and maintaining indexes
- 2. Through the Search Web Application, in order to perform a search using 
keywords
+ [[BR]]2. Through the Search Web Application, in order to perform a search 
using keywords
  
  
  = 3 Implementing a Nutch Search =
@@ -69, +67 @@

  Implementing our own version of Nutch is fairly easy, provided that you;
  
  1. have a basic understanding of how a web search engine works and 
- 2. are comfortable working in a command line and finally
+ [[BR]]2. are comfortable working in a command line and finally
- 3. have a fair knowledge of Java and Servlet containers
+ [[BR]]3. have a fair knowledge of Java and Servlet containers
  
  If you said ‘yes’ to all three questions above, you have a very high 
probability of having your Nutch implementation up and running by the end of 
the steps which follows. 
  
@@ -81, +79 @@

  
  Go to http://www.apache.org/dyn/closer.cgi/lucene/nutch/ and select a mirror 
to download Nutch. The version described in this document is version 0.7. After 
downloading the archive, extract it to your disk. 
  
- NOTE: This document assumes that the archive was extracted to 
/home/tyrell/nutch-0.7 change this path to reflect your location.
+ /!\ NOTE: This document assumes that the archive was extracted to 
/home/tyrell/nutch-0.7 change this path to reflect your location.
  
  === 3.1.2 Download and Install a Servlet Container ===
  
@@ -125, +123 @@

   * -threads threads determines the number of threads that will fetch in 
parallel.
   }}}
  
-   For example, a typical command might be:
+ For example, a typical command might be:
  
{{{ bin/nutch crawl urls -dir crawl.virtusa -depth 10 }}}
  
@@ -225, +223 @@

}}}
  
  
- === 3.5.2 Scheduling Index Updation ===
+ === 3.5.2 Scheduling Index Updations ===
  
  The above shell script can be scheduled to be run periodically using a 
‘cron’ job. 
  


[Nutch Wiki] Update of "Nutch - The Java Search Engine" by tyrellperera

2006-03-23 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by tyrellperera:
http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine

--
  The Nutch search engine consists, very roughly, of three components:
  
  1. The Crawler, which discovers and retrieves web pages
+ 
  2. The ‘WebDB’, a custom database that stores known URLs and fetched page 
contents
+ 
  3. The ‘Indexer’, which dissects pages and builds keyword-based indexes 
from them
  
  After the initial creation of an Index, it is usual to perform periodic 
updates of the index, in order to keep it up-to-date. We will look into the 
details of index maintenance in the parts following this.


[Nutch Wiki] Update of "Nutch - The Java Search Engine" by tyrellperera

2006-03-23 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by tyrellperera:
http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine

New page:
By Tyrell Perera ([EMAIL PROTECTED]), Virtusa Corp. (http://www.virtusa.com)

= 1 Introduction =

== 1.1 What is Nutch? ==

Nutch is an effort to build a Free and Open Source search engine. It uses 
Lucene for the search and index component. The fetcher (robot) has been written 
from scratch solely for this project.

Nutch has a highly modular architecture allowing developers to create plug-ins 
for activities such as media-type parsing, data retrieval, querying and 
clustering.

''Doug Cutting'' is the lead developer of Nutch.


== 1.2 What is Lucene? ==

Lucene is a Free and Open Source search and index API released by the Apache 
Software Foundation. It is written in Java and is released under the Apache 
Software License.

Lucene is just the core of a search engine. As such, it does not include things 
like a web spider or parsers for different document formats. Instead these 
things need to be added by a developer who uses Lucene.

Lucene does not care about the source of the data, its format, or even its 
language, as long as you can convert it to text. This means you can use Lucene 
to index and search data stored in files: web pages on remote web servers, 
documents stored in local file systems, simple text files, Microsoft Word 
documents, HTML or PDF files, or any other format from which you can extract 
textual information.

Lucene has been ported or is in the process of being ported to various 
programming languages other than Java:

  * {{{ Lucene4c - C }}}
  * {{{ CLucene - C++ }}}
  * {{{ MUTIS - Delphi }}}
  * {{{ NLucene - .NET }}}
  * {{{ DotLucene - .NET }}}
  * {{{ Plucene - Perl }}}
  * {{{ Pylucene - Python }}}
  * {{{ Ferret and RubyLucene – Ruby }}}


== 1.3 What License? ==

Both Nutch and Lucene are Apache projects and carry the Apache license 
(http://www.opensource.org/licenses/apache2.0.php).


= 2 The Design of Nutch =

== 2.1 Core Components of Nutch ==

The Nutch search engine consists, very roughly, of three components:

1. The Crawler, which discovers and retrieves web pages
2. The ‘WebDB’, a custom database that stores known URLs and fetched page 
contents
3. The ‘Indexer’, which dissects pages and builds keyword-based indexes 
from them

After the initial creation of an Index, it is usual to perform periodic updates 
of the index, in order to keep it up-to-date. We will look into the details of 
index maintenance in the parts following this.

== 2.2 The Nutch Web Application ==

Apart from the above three components, it has a Search Web Application. This 
application is a JSP application that can be configured and deployed in a 
servlet container.

== 2.3 The Nutch API ==

All components listed above use the nutch API. The users can utilize the API 
via two approaches, which depends on the task at hand.

1. Through the nutch Shell Script for administrative tasks, such as creating 
and maintaining indexes
2. Through the Search Web Application, in order to perform a search using 
keywords


= 3 Implementing a Nutch Search =

Implementing our own version of Nutch is fairly easy, provided that you;

1. have a basic understanding of how a web search engine works and 
2. are comfortable working in a command line and finally
3. have a fair knowledge of Java and Servlet containers

If you said ‘yes’ to all three questions above, you have a very high 
probability of having your Nutch implementation up and running by the end of 
the steps which follows. 


== 3.1 Before We Begin ==

=== 3.1.1 Download Nutch ===

Go to http://www.apache.org/dyn/closer.cgi/lucene/nutch/ and select a mirror to 
download Nutch. The version described in this document is version 0.7. After 
downloading the archive, extract it to your disk. 

NOTE: This document assumes that the archive was extracted to 
/home/tyrell/nutch-0.7 change this path to reflect your location.

=== 3.1.2 Download and Install a Servlet Container ===

Apache Tomcat is a popular Open Source servlet container. We will use this to 
deploy the Nutch search web application in this document. The version referred 
to in this document is version 5.5. You can download Tomcat from 
http://tomcat.apache.org/download-55.cgi 

=== 3.1.3 To Cygwin or not to Cygwin ===

To run the Nutch shell scripts and create the indexes, we require a UNIX like 
environment. If you do not have access to a UNIX environment, you can use 
Cygwin as an alternative. More details and download information for Cygwin can 
be found at http://www.cygwin.com/ 


=== 3.2 Creating the Index ===
In order for the nutch web application to function, it will require at least 
one search index. A search index in nutch is represented in the file system as 
a directory. However, it is much more than that