[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown

2009-04-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by FrankMcCown:
http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0

The comment on the change is:
UAC issue on Vista

--
  
  == Tested with ==
   * Nutch release 1.0
-  * Eclipse 3.3 - aka Europa, ganymede
+  * Eclipse 3.3 (Europa) and 3.4 (Ganymede)
   * Java 1.6
   * Ubuntu (should work on most platforms though)
-  * Windows XP
+  * Windows XP and Vista
  
  == Before you start ==
  
@@ -21, +21 @@

  
  === For Windows Users ===
  
- If you are running Windows (tested on Windows XP) you must first install 
cygwin
+ If you are running Windows (tested on Windows XP) you must first install 
cygwin. Download it from http://www.cygwin.com/setup.exe
  
+ Install cygwin and set the PATH environment variable for it. You can set it 
from the Control Panel, System, Advanced Tab, Environment Variables and 
edit/add PATH.
- Download cygwin from http://www.cygwin.com/setup.exe
- 
- Install cygwin and set PATH variable for it.
- 
- It's in control panel, system, advanced tab, environment variables and 
edit/add PATH
  
  I have in PATH like:
- 
+ {{{
  C:\Sun\SDK\bin;C:\cygwin\bin
+ }}}
+ If you run bash from the Windows command line (Start  Run...  cmd.exe) it 
should successfully run cygwin.
  
- If you run bash in Start  Run...  cmd.exe it should work. 
+ If you are running Eclipse on Vista, you will likely need to 
[http://www.mydigitallife.info/2006/12/19/turn-off-or-disable-user-account-control-uac-in-windows-vista/
 turn off Vista's User Access Control (UAC)]. Otherwise Hadoop will likely 
complain that it cannot change a directory permission when you later run the 
crawler:
+ {{{
+ org.apache.hadoop.util.Shell$ExitCodeException: chmod: changing permissions 
of ... Permission denied
+ }}}
  
+ See 
[http://markmail.org/message/ymgygimtvuksn2ic#query:Exception%20in%20thread%20main%20org.apache.hadoop.util.Shell%24ExitCodeException%3A%20chmod%3A%20changing%20permissions+page:1+mid:pj3spjhvdtjx736q+state:results
 this] for more information about the UAC issue.
  
  === Install Nutch ===
   * Grab a [http://lucene.apache.org/nutch/version_control.html fresh release] 
of Nutch 1.0 or download and untar the [http://lucene.apache.org/nutch/release/ 
official 1.0 release]. 


[Nutch Wiki] Update of RunNutchInEclipse0.9 by BartoszGadzimski

2009-04-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by BartoszGadzimski:
http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9

The comment on the change is:
Added java heap size solution

--
- = RunNutchInEclipse =
+ = Run Nutch In Eclipse on Linux and Windows nutch version 0.9=
  
  This is a work in progress. If you find errors or would like to improve this 
page, just create an account [UserPreferences] and start editing this page :-)
  
@@ -104, +104 @@

   * click on Run
   * if all works, you should see Nutch getting busy at crawling :-)
  
- == Debug Nutch in Eclipse (not yet tested for 0.9) ==
+ == Java Heap Size problem ==
+ 
+ If you find in hadoop.log line similar to this:
+ 
+ {{{
+ 2009-04-13 13:41:06,105 WARN  mapred.LocalJobRunner - job_local_0001
+ java.lang.OutOfMemoryError: Java heap space
+ }}}
+ 
+ You should increase amount of RAM for running applications from eclipse.
+ 
+ Just set it in:
+ 
+ Eclipse - Window - Preferences - Java - Installed JREs - edit - Default 
VM arguments
+ 
+ I've set mine to 
+ {{{
+ -Xms5m -Xmx150m 
+ }}}
+ because I have like 200MB RAM left after runnig all apps
+ 
+ -Xms (minimum ammount of RAM memory for running applications)
+ -Xmx (maximum) 
+ 
+ 
+ == Debug Nutch in Eclipse  ==
   * Set breakpoints and debug a crawl
   * It can be tricky to find out where to set the breakpoint, because of the 
Hadoop jobs. Here are a few good places to set breakpoints:
  {{{


[Nutch Wiki] Update of RunNutchInEclipse1.0 by BartoszGadzimski

2009-04-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by BartoszGadzimski:
http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0

The comment on the change is:
Copied page for 1.0 release

New page:
= Run Nutch In Eclipse on Linux and Windows nutch version 1.0=

This is a work in progress. If you find errors or would like to improve this 
page, just create an account [UserPreferences] and start editing this page :-)

== Tested with ==
 * Nutch release 1.0
 * Eclipse 3.3 - aka Europa, ganymede
 * Java 1.6
 * Ubuntu (should work on most platforms though)
 * Windows XP

== Before you start ==

Setting up Nutch to run into Eclipse can be tricky, and most of the time you 
are much faster if you edit Nutch in Eclipse but run the scripts from the 
command line (my 2 cents).
However, it's very useful to be able to debug Nutch in Eclipse. But again you 
might be quicker by looking at the logs (logs/hadoop.log)...


== Steps ==


=== For Windows Users ===

If you are running Windows (tested on Windows XP) you must first install cygwin

Download cygwin from http://www.cygwin.com/setup.exe

Install cygwin and set PATH variable for it.

It's in control panel, system, advanced tab, environment variables and edit/add 
PATH

I have in PATH like:

C:\Sun\SDK\bin;C:\cygwin\bin

If you run bash in Start-RUN-cmd.exe it should work. 


Then you should install tools from Microsoft website (adding 'whoami' command).

Example for Windows XP and sp2

http://www.microsoft.com/downloads/details.aspx?FamilyId=49AE8576-9BB9-4126-9761-BA8011FABF38displaylang=en


Then you can follow rest of these steps

=== Install Nutch ===
 * Grab a fresh release of Nutch 0.9 - 
http://lucene.apache.org/nutch/version_control.html
 * Do not build Nutch now. Make sure you have no .project and .classpath files 
in the Nutch directory


=== Create a new java project in Eclipse ===
 * File  New  Project  Java project  click Next
 * Name the project (Nutch_Trunk for instance)
 * Select Create project from existing source and use the location where you 
downloaded Nutch
 * Click on Next, and wait while Eclipse is scanning the folders
 * Add the folder conf to the classpath (third tab and then add class folder) 
 * Go to Order and Export tab, find the entry for added conf folder and 
move it to the top. It's required to make eclipse take config 
(nutch-default.xml, nutch-final.xml, etc.) resources from our conf folder not 
anywhere else.
 * Eclipse should have guessed all the java files that must be added on your 
classpath. If it's not the case, add src/java, src/test and all plugin 
src/java and src/test folders to your source folders. Also add all jars in 
lib and in the plugin lib folders to your libraries 
 * Set output dir to tmp_build, create it if necessary
 * DO NOT add build to classpath


=== Configure Nutch ===
 * See the [http://wiki.apache.org/nutch/NutchTutorial Tutorial]
 * Change the property plugin.folders to ./src/plugin on 
$NUTCH_HOME/conf/nutch-defaul.xml 
 * Make sure Nutch is configured correctly before testing it into Eclipse ;-)

=== Missing org.farng and com.etranslate ===
Eclipse will complain about some import statements in parse-mp3 and parse-rtf 
plugins (30 errors in my case).
Because of incompatibility with the Apache license, the .jar files that define 
the necessary classes were not included with the source code. 

Download them here:

http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/

http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/

Copy the jar files into src/plugin/parse-mp3/lib and src/plugin/parse-rtf/lib/ 
respectively.
Then add the jar files to the build path (First refresh the workspace by 
pressing F5. Then right-click the project folder  Build Path  Configure Build 
Path...  Then select the Libraries tab, click Add Jars... and then add each 
.jar file individually).


=== Build Nutch ===
If you setup the project correctly, Eclipse will build Nutch for you into 
tmp_build. See below for problems you could run into.



=== Create Eclipse launcher ===
 * Menu Run  Run...
 * create New for Java Application
 * set in Main class
{{{
org.apache.nutch.crawl.Crawl
}}}
 * on tab Arguments, Program Arguments
{{{
urls -dir crawl -depth 3 -topN 50
}}}
 * in VM arguments
{{{
-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
}}}
 * click on Run
 * if all works, you should see Nutch getting busy at crawling :-)


== Java Heap Size problem ==

If you find in hadoop.log line similar to this:

{{{
2009-04-13 13:41:06,105 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.OutOfMemoryError: Java heap space
}}}

You should increase amount of RAM for running applications from eclipse.

Just set it in:

Eclipse - Window - Preferences - Java - Installed JREs - edit - Default 
VM arguments

I've set mine to 
{{{
-Xms5m -Xmx150m 
}}}
because I have like 200MB RAM left after runnig all apps

-Xms 

[Nutch Wiki] Trivial Update of RunNutchInEclipse0.9 by BartoszGadzimski

2009-04-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by BartoszGadzimski:
http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9

--
- = Run Nutch In Eclipse on Linux and Windows nutch version 0.9=
+ = Run Nutch In Eclipse on Linux and Windows nutch version 0.9 =
  
  This is a work in progress. If you find errors or would like to improve this 
page, just create an account [UserPreferences] and start editing this page :-)
  


[Nutch Wiki] Trivial Update of RunNutchInEclipse1.0 by BartoszGadzimski

2009-04-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by BartoszGadzimski:
http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0

--
- = Run Nutch In Eclipse on Linux and Windows nutch version 1.0=
+ = Run Nutch In Eclipse on Linux and Windows nutch version 1.0 =
  
  This is a work in progress. If you find errors or would like to improve this 
page, just create an account [UserPreferences] and start editing this page :-)
  


[Nutch Wiki] Update of FrontPage by BartoszGadzimski

2009-04-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by BartoszGadzimski:
http://wiki.apache.org/nutch/FrontPage

--
   * UpgradeFrom07To08
   * [Upgrading_from_0.8.x_to_0.9]
   * RunNutchInEclipse for v0.8
-  * [RunNutchInEclipse0.9] for v0.9
+  * [RunNutchInEclipse0.9] for v0.9 (Linux and Windows)
+  * [RunNutchInEclipse1.0] for v1.0 (Linux and Windows)
   * [Crawl] - script to crawl (and possible recrawl too)
   * IntranetRecrawl - script to recrawl a crawl
   * MergeCrawl - script to merge 2 (or more) crawls 


[Nutch Wiki] Update of RunNutchInEclipse0.9 by BartoszGadzimski

2009-04-10 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by BartoszGadzimski:
http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9

--
  This is a work in progress. If you find errors or would like to improve this 
page, just create an account [UserPreferences] and start editing this page :-)
  
  == Tested with ==
-  * Nutch release 0.9
+  * Nutch release 0.9 and 1.0
   * Eclipse 3.3 - aka Europa
   * Java 1.6
   * Ubuntu (should work on most platforms though)
@@ -35, +35 @@

  C:\Sun\SDK\bin;C:\cygwin\bin
  
  If you run bash in Start-RUN-cmd.exe it should work. 
+ 
+ 
+ Then you should install tools from Microsoft website (adding 'whoami' 
command).
+ 
+ Example for Windows XP and sp2
+ 
+ 
http://www.microsoft.com/downloads/details.aspx?FamilyId=49AE8576-9BB9-4126-9761-BA8011FABF38displaylang=en
  
  Then you can follow rest of these steps
  


[Nutch Wiki] Update of HttpAuthenticationSchemes by susam

2009-03-31 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

--
  === Important Points ===
   1. For authscope tag, 'host' and 'port' attribute should always be 
specified. 'realm' and 'scheme' attributes may or may not be specified 
depending on your needs. If you are tempted to omit the 'host' and 'port' 
attribute, because you want the credentials to be used for any host and any 
port for that realm/scheme, please use the 'default' tag instead. That's what 
'default' tag is meant for.
   1. One authentication scope should not be defined twice as different 
authscope tags for different credentials tag. However, if this is done by 
mistake, the credentials for the last defined authscope tag would be used. 
This is because, the XML parsing code, reads the file from top to bottom and 
sets the credentials for authentication-scopes. If the same authentication 
scope is encountered once again, it will be overwritten with the new 
credentials. However, one should not rely on this behavior as this might change 
with further developments.
-  1. Do not define multiple authscope tags with the same host, port but 
different realms if the server requires NTLM authentication. This can means 
there should not be multiple tags with same host, port, scheme=NTLM but 
different realms. If you are omitting the scheme attribute and the server 
requires NTLM authentication, then there should not be multiple tags with same 
host, port but different realms. This is discussed more in the next section.
+  1. Do not define multiple authscope tags with the same host, port but 
different realms if the server requires NTLM authentication. This means there 
should not be multiple tags with same host, port, scheme=NTLM but different 
realms. If you are omitting the scheme attribute and the server requires NTLM 
authentication, then there should not be multiple tags with same host, port but 
different realms. This is discussed more in the next section.
   1. If you are using NTLM scheme, you should also set the 'http.agent.host' 
property in conf/nutch-site.xml
  
  === A note on NTLM domains ===
@@ -104, +104 @@

   1. Do you see Nutch trying to fetch the pages you were expecting in 
'logs/hadoop.log'. You should see some logs like fetching 
http://www.example.com/expectedpage.html; where the URL is the page you were 
expecting to be fetched. If you don't see such lines for the pages you were 
expecting, the error is outside the scope of this feature. This feature comes 
into action only when the crawler is fetching a page but the page requires 
authentication.
   1. With debug logs enabled, check whether there are logs beginning with 
Credentials in 'logs/hadoop.log'. The lines would look like Credentials - 
username someuser; set  For every entry in 'conf/httpclient-auth.xml' you 
should find a corresponding log. If they are absent, probably you haven't 
included 'plugin.includes'. In case you have manually patched Nutch 0.9 source 
code with the patch, this issue may be caused if you have not built the project.
   1. Do you see logs like this: auth.!AuthChallengeProcessor - basic 
authentication scheme selected? Instead of the word 'basic', you might see 
'digest' or 'NTLM' depending on the scheme supported by the page being fetched? 
If you do not see it at all, probably the web server or the page being fetched 
does not require authentication. In that case, the crawler would not try to 
authenticate. If you were expecting an authentication for the page, probably 
something needs to be fixed at the server side.
-  1. You should also see some logs that begin with: Pre-configured 
credentials with scope. It is very unlikely that this should happen after you 
have ensured all the above points. If it happens, please let us know in the 
mailing list.
  
  Once you have checked the items listed above and you are still unable to fix 
the problem or confused about any point listed above, please mail the issue 
with the following information:
  
   1. Version of Nutch you are running.
-  1. Did you get this feature directly from subversion or did you download the 
patch separately and apply?
+  1. Complete code in ''conf/httpclient-auth.xml' file.
   1. Relevant portion from 'logs/hadoop.log' file. If you are clueless, send 
the complete file.
  


[Nutch Wiki] Update of PublicServers by KevinReader

2009-03-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by KevinReader:
http://wiki.apache.org/nutch/PublicServers

--
* [http://campusgw.library.cornell.edu/ Cornell University Library] is 
collaborating with the research group of Thorsten Joachims to develop a 
learning search engine for library web pages based on Nutch. The nutch-based 
search engine is near the bottom of the page.
  
* [http://search.creativecommons.org/ Creative Commons] is a search engine 
for creative commons licensed material.
+ 
+   * [http://www.dadi360.com/ Dadi360] Usee nutch search engine for providing 
search of Chinese language websites in North America.
  
* [http://www.ecolicommunity.org/Websearch Ecolhub Web Search] an E. coli 
specific search engine based on Nutch. EcoliHub WebSearch includes only those 
sites relevant to E. coli, thereby reducing the number of spurious hits. 
Searches can be optionally limited to your choice of resources. More than 
110,000 pages to search. More resources getting added.
  


[Nutch Wiki] Update of NewScoringIndexingExample by DennisKubes

2009-03-25 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by DennisKubes:
http://wiki.apache.org/nutch/NewScoringIndexingExample

The comment on the change is:
comment pointing out multiple segment flags

--
  = Example Running new Scoring and Indexing Systems =
  

  Below is an example of running the new scoring and indexing systems from 
start to finish.  This was done with a sample of 1000 urls and I ran two 
different fetch cycles.  The first being 1000 urls and the second being the top 
2000 urls.  The loops job is optional but included for completeness.  In 
production we have actually removed that job.  This was done with a clean pull 
from Nutch trunk as of 2009-03-06 (right before 1.0 is set to be released).  If 
anybody has any problems running these commands or has questions send me an 
email or send one to the nutch users or dev list and I will reply.  Please send 
it to kubes at the apache address dot org.
+ 
  
  {{{
  bin/nutch inject crawl/crawldb crawl/urls/
@@ -10, +11 @@

  bin/nutch fetch crawl/segments/20090306093949/
  bin/nutch updatedb crawl/crawldb/ crawl/segments/20090306093949/
  bin/nutch org.apache.nutch.scoring.webgraph.WebGraph -segment 
crawl/segments/20090306093949/ -webgraphdb crawl/webgraphdb
+ 
  bin/nutch org.apache.nutch.scoring.webgraph.Loops -webgraphdb 
crawl/webgraphdb/
  bin/nutch org.apache.nutch.scoring.webgraph.LinkRank -webgraphdb 
crawl/webgraphdb/
  bin/nutch org.apache.nutch.scoring.webgraph.ScoreUpdater -crawldb 
crawl/crawldb -webgraphdb crawl/webgraphdb/
@@ -55, +57 @@

  bin/nutch updatedb crawl/crawldb/ crawl/segments/20090306100055/
  rm -fr crawl/webgraphdb/
  bin/nutch org.apache.nutch.scoring.webgraph.WebGraph -segment 
crawl/segments/20090306093949/ -segment crawl/segments/20090306100055/ 
-webgraphdb crawl/webgraphdb
+ }}}
+ 
+ One thing that has been brought up is the -segment flag on webgraph.  If you 
have more than one segment then you would have more than one segment flag as 
shown above.
+ 
+ {{{
  bin/nutch org.apache.nutch.scoring.webgraph.Loops -webgraphdb 
crawl/webgraphdb/
  bin/nutch org.apache.nutch.scoring.webgraph.LinkRank -webgraphdb 
crawl/webgraphdb/
  bin/nutch org.apache.nutch.scoring.webgraph.ScoreUpdater -crawldb 
crawl/crawldb -webgraphdb crawl/webgraphdb/


[Nutch Wiki] Update of NewScoringIndexingExample by DennisKubes

2009-03-25 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by DennisKubes:
http://wiki.apache.org/nutch/NewScoringIndexingExample

--
  bin/nutch updatedb crawl/crawldb/ crawl/segments/20090306093949/
  bin/nutch org.apache.nutch.scoring.webgraph.WebGraph -segment 
crawl/segments/20090306093949/ -webgraphdb crawl/webgraphdb
  
+ }}}
+ 
+ One thing to point out here is that WebGraph is meant to be used on larger 
web crawls to create web graphs.  By default it ignores outlinks to pages in 
the same domain, including subdomains, and pages with the same hostname.  It 
also limits to one outlink per page to links in the same page or the same 
domain.  All of these options are changeable through the following 
configuration options:
+ 
+ {{{
+ 
+ !-- linkrank scoring properties --
+ property
+   namelink.ignore.internal.host/name
+   valuetrue/value
+   descriptionIgnore outlinks to the same hostname./description
+ /property
+ 
+ property
+   namelink.ignore.internal.domain/name
+   valuetrue/value
+   descriptionIgnore outlinks to the same domain./description
+ /property
+ 
+ property
+   namelink.ignore.limit.page/name
+   valuetrue/value
+   descriptionLimit to only a single outlink to the same page./description
+ /property
+ 
+ property
+   namelink.ignore.limit.domain/name
+   valuetrue/value
+   descriptionLimit to only a single outlink to the same 
domain./description
+ /property 
+ 
+ }}}
+ 
+ But by default if you are only crawling pages within a domain or within a set 
of subdomains, all outlinks will be ignored and you will come up with an empty 
webgraph.  This in turn will throw an error while processing through the 
LinkRank job.  The flip side is by NOT ignoring links to the same domain/host 
and by not limiting those links, the webgraph becomes much, much more dense and 
hence there is a lot more links to process which probably won't affect 
relevancy as much.
+ 
+ {{{
  bin/nutch org.apache.nutch.scoring.webgraph.Loops -webgraphdb 
crawl/webgraphdb/
  bin/nutch org.apache.nutch.scoring.webgraph.LinkRank -webgraphdb 
crawl/webgraphdb/
  bin/nutch org.apache.nutch.scoring.webgraph.ScoreUpdater -crawldb 
crawl/crawldb -webgraphdb crawl/webgraphdb/


[Nutch Wiki] Update of HardwareRequirements by NycoNyco

2009-03-24 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by NycoNyco:
http://wiki.apache.org/nutch/HardwareRequirements

The comment on the change is:
title

--
- 
  = Hardware Requirements =
  
  In general, fetching and database updates require lots of disk, and searching 
is faster with more RAM. But the particulars depend on how big of an index 
you're trying to build and how much query traffic you expect.
+ 
+ == Requirements for indexing ==
  
  As a general rule, each page fetched requires around 10k of disk overall (for 
the page cache, its text, the index, db entries, etc.). So a terabyte of 
storage is required for every 100M pages.
  


[Nutch Wiki] Update of Features by NycoNyco

2009-03-24 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by NycoNyco:
http://wiki.apache.org/nutch/Features

The comment on the change is:
(non-exhaustive) tentative features list (please review)

--
  (Please reformat this text and divide into feature lists, questions and 
questions  answers). 
  
  == Features ==
+ 
+  * Fetching, parsing and indexation in parallel and/ou distributed
+  * Plugins
+  * Many formats: plain text, HTML, XML, ZIP, OpenDocument (OpenOffice.org), 
Microsoft Office (Word, Excel, Powerpoint), PDF, JavaScript, RSS, RTF, MP3 (ID3 
tags)
+  * Ontology
+  * Clustering
+  * MapReduce ;
+  * Distributed filesystem (via Hadoop)
+  * Link-graph database
+  * NTLM authentication
  
  == Questions and Answers ==
  


[Nutch Wiki] Update of RunNutchInEclipse0.9 by BartoszGadzimski

2009-03-20 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by BartoszGadzimski:
http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9

The comment on the change is:
added description for Windows users

--
   * Eclipse 3.3 - aka Europa
   * Java 1.6
   * Ubuntu (should work on most platforms though)
+  * Windows XP
  
  == Before you start ==
  
  Setting up Nutch to run into Eclipse can be tricky, and most of the time you 
are much faster if you edit Nutch in Eclipse but run the scripts from the 
command line (my 2 cents).
  However, it's very useful to be able to debug Nutch in Eclipse. But again you 
might be quicker by looking at the logs (logs/hadoop.log)...
  
+ 
  == Steps ==
+ 
+ 
+ === For Windows Users ===
+ 
+ If you are running Windows (tested on Windows XP) you must first install 
cygwin
+ 
+ Download cygwin from http://www.cygwin.com/setup.exe
+ 
+ Install cygwin and set PATH variable for it.
+ 
+ It's in control panel, system, advanced tab, environment variables and 
edit/add PATH
+ 
+ I have in PATH like:
+ 
+ C:\Sun\SDK\bin;C:\cygwin\bin
+ 
+ If you run bash in Start-RUN-cmd.exe it should work. 
+ 
+ Then you can follow rest of these steps
  
  === Install Nutch ===
   * Grab a fresh release of Nutch 0.9 - 
http://lucene.apache.org/nutch/version_control.html
   * Do not build Nutch now. Make sure you have no .project and .classpath 
files in the Nutch directory
+ 
  
  === Create a new java project in Eclipse ===
   * File  New  Project  Java project  click Next


[Nutch Wiki] Update of NewScoringIndexingExample by DennisKubes

2009-03-06 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by DennisKubes:
http://wiki.apache.org/nutch/NewScoringIndexingExample

New page:
= Example Running new Scoring and Indexing Systems =

Below is an example of running the new scoring and indexing systems from start 
to finish.  This was done with a sample of 1000 urls and I ran two different 
fetch cycles.  The first being 1000 urls and the second being the top 2000 
urls.  The loops job is optional but included for completeness.  In production 
we have actually removed that job.  This was done with a clean pull from Nutch 
trunk as of 2009-03-06 (right before 1.0 is set to be released).  If anybody 
has any problems running these commands or has questions send me an email or 
send one to the nutch users or dev list and I will reply.  Please send it to 
kubes at the apache address dot org.

{{{
bin/nutch inject crawl/crawldb crawl/urls/
bin/nutch generate crawl/crawldb/ crawl/segments
bin/nutch fetch crawl/segments/20090306093949/
bin/nutch updatedb crawl/crawldb/ crawl/segments/20090306093949/
bin/nutch org.apache.nutch.scoring.webgraph.WebGraph -segment 
crawl/segments/20090306093949/ -webgraphdb crawl/webgraphdb
bin/nutch org.apache.nutch.scoring.webgraph.Loops -webgraphdb crawl/webgraphdb/
bin/nutch org.apache.nutch.scoring.webgraph.LinkRank -webgraphdb 
crawl/webgraphdb/
bin/nutch org.apache.nutch.scoring.webgraph.ScoreUpdater -crawldb crawl/crawldb 
-webgraphdb crawl/webgraphdb/
bin/nutch org.apache.nutch.scoring.webgraph.NodeDumper -scores -topn 1000 
-webgraphdb crawl/webgraphdb/ -output crawl/webgraphdb/dump/scores


more crawl/webgraphdb/dump/scores/part-0

http://validator.w3.org/check?uri=referer   0.4955311
http://www.adobe.com/go/getflashplayer  0.4060498
http://www.statcounter.com/ 0.4060498
http://www.liveinternet.ru/click0.33680826
http://www.adobe.com/products/acrobat/readstep2.html0.31656843
http://www.adobe.com/go/getflashplayer/ 0.30378538
http://www.bloomingbows.com/2003/scripts/sitemap.asp0.27821928
http://www.misterping.com/  0.27821928
...



bin/nutch readdb crawl/crawldb/ -stats

CrawlDb statistics start: crawl/crawldb/
Use GenericOptionsParser for parsing the arguments. Applications should 
implement Tool for the same.
Statistics for CrawlDb: crawl/crawldb/
TOTAL urls: 16711
retry 0:16686
retry 1:25
min score:  0.0
avg score:  0.022716654
max score:  0.495
status 1 (db_unfetched):15739
status 2 (db_fetched):  677
status 3 (db_gone): 75
status 4 (db_redir_temp):   143
status 5 (db_redir_perm):   77
CrawlDb statistics: done



bin/nutch generate crawl/crawldb/ crawl/segments/ -topN 2000
bin/nutch fetch crawl/segments/20090306100055/
bin/nutch updatedb crawl/crawldb/ crawl/segments/20090306100055/
rm -fr crawl/webgraphdb/
bin/nutch org.apache.nutch.scoring.webgraph.WebGraph -segment 
crawl/segments/20090306093949/ -segment crawl/segments/20090306100055/ 
-webgraphdb crawl/webgraphdb
bin/nutch org.apache.nutch.scoring.webgraph.Loops -webgraphdb crawl/webgraphdb/
bin/nutch org.apache.nutch.scoring.webgraph.LinkRank -webgraphdb 
crawl/webgraphdb/
bin/nutch org.apache.nutch.scoring.webgraph.ScoreUpdater -crawldb crawl/crawldb 
-webgraphdb crawl/webgraphdb/

more crawl/webgraphdb/dump/scores/part-0

http://www.statcounter.com/ 1.7133079
http://www.morristownwebdesign.com/ 1.0093393
http://www.jdoqocy.com/click-3331968-10419685   0.87828785
http://www.anrdoezrs.net/click-3331968-10384568 0.87828785
http://www.sedo.com/main.php3?language=e0.6565905
http://wetter.spiegel.de/spiegel/html/frankreich0.html  0.641775
http://www.kenwood.com/ 0.6084726
http://validator.w3.org/check?uri=referer   0.5605916
http://wetter.spiegel.de/spiegel/html/Italien0.html 0.5164927
http://www.youtube.com/?hl=entab=w10.50952965
http://www.addthis.com/bookmark.php 0.5013165
http://www.ptguide.com/ 0.49564213
http://www.adobe.com/go/getflashplayer  0.47368217
http://de.weather.yahoo.com/ITXX/ITXX0073/index_c.html  0.4657473
http://www.adobe.com/shockwave/download/download.cgi?P1_Prod_Version=ShockwaveFlashpromoid=BIOW
0.44376293
http://www.google.com/  0.42282072
http://www.zajezdy.cz/  0.41620353
http://www.intermarche.com/ 0.41489196
http://www.shipskill.com/7/ 0.4147887
http://www.statcounter.com/free_hit_counter.html0.40928197

[Nutch Wiki] Update of FrontPage by DennisKubes

2009-03-06 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by DennisKubes:
http://wiki.apache.org/nutch/FrontPage

The comment on the change is:
Added example for new scoring and indexing systems

--
  == Nutch 2.0 ==
   * [Nutch2Architecture] -- Discussions on the Nutch 2.0 architecture.
   * [NewScoring] -- New stable pagerank like webgraph and link-analysis jobs.
+  * [NewScoringIndexingExample] -- Two full fetch cycles of commands using 
new scoring and indexing systems.
  
  == Other Resources ==
   * [http://nutch.sourceforge.net/blog/cutting.html Doug's Weblog] -- He's the 
one who originally wrote Lucene and Nutch.


[Nutch Wiki] Update of DownloadingNutch by BartoszGadzimski

2009-02-26 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by BartoszGadzimski:
http://wiki.apache.org/nutch/DownloadingNutch

--
  You have two choices in how to get Nutch:
-   1. You can download a release from http://lucene.apache.org/nutch/release/. 
 This will give you a relatively stable release.  At the moment the latest 
release is 0.8.
+   1. You can download a release from http://lucene.apache.org/nutch/release/. 
 This will give you a relatively stable release.  At the moment the latest 
release is 0.9.
2. Or, you can check out the latest source code from subversion and build 
it with Ant.  This gets you closer to the bleeding edge of development.  The 
0.9 should be relatively stable but the trunk (from which the 
[http://lucene.apache.org/nutch/nightly.html nightly builds] are build) is 
under heavy development with bugs showing up and getting squashed fairly 
frequently. 
  
  Note: As of 5/29/08 the Subversion trunk seems to be much better than the 0.9 
release. If you have trouble with 0.9 your best bet is to try moving to trunk 
and see if the problems resolve themselves.


[Nutch Wiki] Update of SimpleMapReduceTutorial by BartoszGadzimski

2009-02-26 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by BartoszGadzimski:
http://wiki.apache.org/nutch/SimpleMapReduceTutorial

The comment on the change is:
It is not map reduce tutorial, it's only confusing people

--
- This is the simplest map reduce example I could come up with. Local 
filesystem, just getting one segment indexed. I am running Ubuntu, on an Athlon 
3200+ using a cable modem connection.
+ deleted
  
- == Designate Url ==
- 
- Need to get to the right place
- 
- {{{
- cd nutch/branches/mapred
- }}}
- 
- We need to make a directory that contains files, where each line of each file 
is a url. I choose http://lucene.apache.org/nutch/
- 
- {{{
- mkdir urls
- echo http://lucene.apache.org/nutch/;  urls/urls
- }}}
- 
- Also need to change the crawl filter to include this site
- 
- {{{
- perl -pi -e 's|MY.DOMAIN.NAME|lucene.apache.org/nutch|' 
conf/crawl-urlfilter.txt
- }}}
- 
- We walk through the following steps: crawl, generate, fetch, updatedb, 
invertlinks, index.
- 
- == Crawl ==
- 
- We want to run crawl on the urls directory from above.
- 
- {{{
- ./bin/nutch crawl urls
- }}}
- 
- Took me about ten minutes. Output included
- 
- 051004 003916 178 pages, 17 errors, 0.4 pages/s, 48 kb/s
- 
- The errors generally seemed to be timeouts.
- 
- The rest of the commands are a bit more dynamic, relying on timestamp and the 
like. Environment variables help out.
- 
- == Generate ==
- 
- Here we walk a segment dir from the crawl above.
- 
- {{{
- CRAWLDB=`find crawl-2* -name crawldb`
- SEGMENTS_DIR=`find crawl-2* -maxdepth 1 -name segments`
- ./bin/nutch generate $CRAWLDB $SEGMENTS_DIR
- }}}
- 
- Took less than five seconds.
- 
- == Fetch ==
- 
- {{{
- SEGMENT=`find crawl-2*/segments/2* -maxdepth 0 | tail -1`
- ./bin/nutch fetch $SEGMENT
- }}}
- 
- Took about seven minutes, and output looked like
- 
- 051004 004931 65 pages, 404 errors, 0.2 pages/s, 19 kb/s,
- 
- Again, many timeouts.
- 
- == UbdateDB ==
- 
- {{{
- ./bin/nutch updatedb $CRAWLDB $SEGMENT
- }}}
- 
- Took less than five seconds.
- 
- == InvertLinks ==
- 
- {{{
- LINKDB=`find crawl-2* -name linkdb -maxdepth 1`
- SEGMENTS=`find crawl-2* -name segments -maxdepth 1`
- ./bin/nutch invertlinks $LINKDB $SEGMENTS
- }}}
- 
- Took less than five seconds.
- 
- == Index ==
- 
- We need a place for our index, say myindex
- 
- {{{
- mkdir myindex
- }}}
- 
- Now, let's index.
- 
- {{{
- ./bin/nutch index myindex $LINKDB $SEGMENT
- }}}
- 
- Took less than ten seconds.
- 
- == Test ==
- 
- The best test I have for the moment is
- 
- {{{
- ls -alR myindex
- }}}
- 
- If you see several files, it at least did something. Happy nutching!
- 
- Tutorial written by Earl Cahill, 2005.
- 


[Nutch Wiki] Trivial Update of FrontPage by BartoszGadzimski

2009-02-26 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by BartoszGadzimski:
http://wiki.apache.org/nutch/FrontPage

--
   * GettingNutchRunningWithDebian
   * GettingNutchRunningWithSocksProxy
   * ErrorMessages -- What they mean and suggestions for getting rid of them.
-  * SimpleMapReduceTutorial
   * SetupProxyForNutch - using Tinyproxy on Ubuntu
   * CreateNewFilter - for example to add a category metadata to your index and 
be able to search for it
   * UpgradeFrom07To08


[Nutch Wiki] Update of InstallingWeb2 by SamiSiren

2009-02-20 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by SamiSiren:
http://wiki.apache.org/nutch/InstallingWeb2

--
+ == NOTE: Web2 module is no longer part of Nutch ==
+ So these instructions do no longer apply.
+ 
+ 
+ 
  chris sleeman wrote:
   Hi,
   


[Nutch Wiki] Update of RunNutchInEclipse0.9 by FrankMcCown

2009-02-19 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by FrankMcCown:
http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9

The comment on the change is:
Corrected instruction

--
  http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/
  
  Copy the jar files into src/plugin/parse-mp3/lib and 
src/plugin/parse-rtf/lib/ respectively.
+ Then add the jar files to the build path (First refresh the workspace by 
pressing F5. Then right-click the project folder  Build Path  Configure Build 
Path...  Then select the Libraries tab, click Add Jars... and then add each 
.jar file individually).
- Then add the jar files to the build path (First refresh the workspace. Then 
right-click on the source
- folder  Java Build Path  Libraries  Add Jars. In Eclipse version 3.4, 
right-click the project folder  Build Path  Configure Build Path...  Then 
select the Libraries tab, click Add Jars... and then add each .jar file 
individually).
  
  
  === Build Nutch ===


[Nutch Wiki] Update of IntranetRecrawl by SAnand

2009-02-12 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by SAnand:
http://wiki.apache.org/nutch/IntranetRecrawl

The comment on the change is:
Suggested fix for index/merge-output already exists error when merging indices

--
  
  No changes necessary for this to run with Nutch 0.9.0.
  
+ However, if you get an error message indicating that the folder 
index/merge-output already exists, move the index/merge-output folder back 
into the index/ folder. For example:
+ {{{
+ mv $index_dir/merge-output /tmp
+ rm -rf $index_dir
+ mv /tmp/merge-output $index_dir
+ }}}
  === Code ===
  
  {{{


[Nutch Wiki] Trivial Update of RunNutchInEclipse0.9 by FrankMcCown

2009-02-11 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by FrankMcCown:
http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9

The comment on the change is:
clarified

--
  http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/
  
  Copy the jar files into src/plugin/parse-mp3/lib and 
src/plugin/parse-rtf/lib/ respectively.
- Then add them to the libraries to the build path (First refresh the 
workspace. Then right-click on the source
+ Then add the jar files to the build path (First refresh the workspace. Then 
right-click on the source
  folder  Java Build Path  Libraries  Add Jars. In Eclipse version 3.4, 
right-click the project folder  Build Path  Configure Build Path...  Then 
select the Libraries tab, click Add Jars... and then add each .jar file 
individually).
  
  


[Nutch Wiki] Update of GettingNutchRunningWithWindows by FrankMcCown

2009-02-11 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by FrankMcCown:
http://wiki.apache.org/nutch/GettingNutchRunningWithWindows

The comment on the change is:
Added some clarifications

--
  
  === Download ===
  
- [http://lucene.apache.org/nutch/release/ Download] the release and extract 
anywhere on your hard disk e.g. `c:\nutch-0.9`
+ [http://lucene.apache.org/nutch/release/ Download] the release and extract on 
your hard disk in a directory that ''does not'' contain a space in it (e.g., 
`c:\nutch-0.9`).  If the directory does contain a space (e.g., `c:\my 
programs\nutch-0.9`), the Nutch scripts will not work properly.
  
- Create an empty text file in your nutch directory e.g. `urls` and add the 
URLs of the sites you want to crawl.
+ Create an empty text file (use any name you wish) in your nutch directory 
(e.g., `urls`) and add the URLs of the sites you want to crawl.
  
- Add your URLs to the `crawl-urlfilter.txt` (e.g. 
`C:\nutch-0.9\conf\crawl-urlfilter.txt`). An entry could look like this:
+ Add your URLs to the `crawl-urlfilter.txt` (e.g., 
`C:\nutch-0.9\conf\crawl-urlfilter.txt`). An entry could look like this:
  {{{
  +^http://([a-z0-9]*\.)*apache.org/
  }}}
  
- Load up cygwin and naviagte to your nutch directory.  When cygwin launches 
you'll usually find yourself in your user folder (e.g. `C:\Documents and 
Settings\username`).
+ Load up cygwin and navigate to your `nutch` directory.  When cygwin launches, 
you'll usually find yourself in your user folder (e.g. `C:\Documents and 
Settings\username`).
  
- If your workstation needs to go through a windows authentication proxy to get 
to the internet then you can use an application such as the 
[http://sourceforge.net/projects/ntlmaps/ NTLM Authorization Proxy Server] to 
get through it.  You'll then need to edit the `nutch-site.xml` file to point to 
the port opened by the app.
+ If your workstation needs to go through a Windows Authentication Proxy to get 
to the Internet (this is not common), then you can use an application such as 
the [http://sourceforge.net/projects/ntlmaps/ NTLM Authorization Proxy Server] 
to get through it.  You'll then need to edit the `nutch-site.xml` file to point 
to the port opened by the app.
  
  == Intranet Crawling ==
  
@@ -48, +48 @@

  {{{
  bin/nutch crawl urls -dir crawl -depth 3  crawl.log
  }}}
- then a folder called crawl/ is created in your nutch directory, along with 
the crawl.log file.  Use this log file to debug any errors you might have.
+ then a folder called `crawl` is created in your `nutch` directory, along with 
the crawl.log file.  Use this log file to debug any errors you might have.
  
  You'll need to delete or move the crawl directory before starting the crawl 
off again unless you specify another path on the command above.
  


[Nutch Wiki] Update of RunNutchInEclipse0.9 by FrankMcCown

2009-02-10 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by FrankMcCown:
http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9

The comment on the change is:
Clarified some instructions and improved grammar

--
   * Nutch release 0.9
   * Eclipse 3.3 - aka Europa
   * Java 1.6
-  * Ubuntu (should work on most platform, though)
+  * Ubuntu (should work on most platforms though)
  
  == Before you start ==
  
@@ -34, +34 @@

  
  
  === Configure Nutch ===
-  * see the [http://wiki.apache.org/nutch/NutchTutorial Tutorial]
+  * See the [http://wiki.apache.org/nutch/NutchTutorial Tutorial]
-  * change the property plugin.folders to ./src/plugin on 
$NUTCH_HOME/conf/nutch-defaul.xml 
+  * Change the property plugin.folders to ./src/plugin on 
$NUTCH_HOME/conf/nutch-defaul.xml 
-  * make sure Nutch is configured correctly before testing it into Eclipse ;-)
+  * Make sure Nutch is configured correctly before testing it into Eclipse ;-)
  
- === missing org.farng and com.etranslate ===
+ === Missing org.farng and com.etranslate ===
- You will encounter problems with some imports in parse-mp3 and parse-rtf 
plugins (30 errors in my case).
+ Eclipse will complain about some import statements in parse-mp3 and parse-rtf 
plugins (30 errors in my case).
- Because of incompatibility with Apache license they were left from sources. 
+ Because of incompatibility with the Apache license, the .jar files that 
define the necessary classes were not included with the source code. 
+ 
- You can download them here:
+ Download them here:
  
  http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/
  
  http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/
  
  Copy the jar files into src/plugin/parse-mp3/lib and 
src/plugin/parse-rtf/lib/ respectively.
- Then add them to the libraries to the build path (First refresh the 
workspace. Then Right click on the source
+ Then add them to the libraries to the build path (First refresh the 
workspace. Then right-click on the source
- folder = Java Build Path = Libraries = Add Jars).
+ folder  Java Build Path  Libraries  Add Jars. In Eclipse version 3.4, 
right-click the project folder  Build Path  Configure Build Path...  Then 
select the Libraries tab, click Add Jars... and then add each .jar file 
individually).
  
  
  === Build Nutch ===
-  * In case you setup the project correctly, Eclipse will build Nutch for you 
into tmp_build.
+ If you setup the project correctly, Eclipse will build Nutch for you into 
tmp_build. See below for problems you could run into.
- 
- 
  
  
  


[Nutch Wiki] Update of Mailing by GrantIngersoll

2009-01-26 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by GrantIngersoll:
http://wiki.apache.org/nutch/Mailing

--
  
  == List Archives ==
  
+ [http://www.lucidimagination.com/search] - Search the Lucene ecosystem, 
including Nutch.  Powered by Lucene/Solr.
  [http://www.mail-archive.com/index.php?hunt=nutch Searchble Nutch] list 
archives.
  [http://www.nabble.com/Nutch-f362.html nutch archives] nabble.com archives.
  


[Nutch Wiki] Update of Mailing by GrantIngersoll

2009-01-26 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by GrantIngersoll:
http://wiki.apache.org/nutch/Mailing

--
  
  == List Archives ==
  
- [http://www.lucidimagination.com/search] - Search the Lucene ecosystem, 
including Nutch.  Powered by Lucene/Solr.
+  * [http://www.lucidimagination.com/search] - Search the Lucene ecosystem, 
including Nutch.  Powered by Lucene/Solr.
- [http://www.mail-archive.com/index.php?hunt=nutch Searchble Nutch] list 
archives.
+  * [http://www.mail-archive.com/index.php?hunt=nutch Searchble Nutch] list 
archives.
- [http://www.nabble.com/Nutch-f362.html nutch archives] nabble.com archives.
+  * [http://www.nabble.com/Nutch-f362.html nutch archives] nabble.com archives.
  


[Nutch Wiki] Update of NewScoring by OtisGospodnetic

2009-01-13 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by OtisGospodnetic:
http://wiki.apache.org/nutch/NewScoring

--
- This page describes the new scoring (i.e. WebGraph and Link Analysis) 
functionality in Nutch as of revision 723441.
+ This page describes the new scoring (i.e. !WebGraph and Link Analysis) 
functionality in Nutch as of revision 723441.
  
  == General Information ==
  The new scoring functionality can be found in 
org.apache.nutch.scoring.webgraph.  This package contains multiple programs 
that build web graphs, perform a stable convergent link-analysis, and update 
the crawldb with those scores.  These programs assume that fetching cycles have 
already been completed and now the users want to build a global webgraph from 
those segments and from that webgraph perform link-analysis to get a single 
global relevancy score for each url.  Building a webgraph assumes that all 
links are stored in the current segments to be processed.  Links are not held 
over from one processing cycle to another.  Global link-analysis scores are 
based on the current links available and scores will change as the link 
structure of the webgraph changes.
@@ -8, +8 @@

  Currently the scoring jobs are not integrated into the Nutch script as 
commands and must be run in the form bin/nutch 
org.apache.nutch.scoring.webgraph..
  
  === WebGraph ===
- The WebGraph program is the first job that must be run once all segments are 
fetched and ready to be processed.  WebGraph is found at 
org.apache.nutch.scoring.webgraph.WebGraph. Below is a printout of the programs 
usage.
+ The !WebGraph program is the first job that must be run once all segments are 
fetched and ready to be processed.  !WebGraph is found at 
org.apache.nutch.scoring.webgraph.!WebGraph. Below is a printout of the 
programs usage.
  
  {{{
  usage: WebGraph
@@ -17, +17 @@

   -webgraphdb webgraphdb   the web graph database to use
  }}}
  
- The WebGraph program can take multiple segments to process and requires an 
output directory in which to place the completed web graph components.  The 
WebGraph creates three different components, and inlink database, an outlink 
database, and a node database.  The inlink database is a listing of url and all 
of its inlinks.  The outlink database is a listing of url and all of its 
outlinks.  The node database is a listing of url with node meta information 
including the number of inlinks and outlinks, and eventually the score for that 
node.
+ The !WebGraph program can take multiple segments to process and requires an 
output directory in which to place the completed web graph components.  The 
!WebGraph creates three different components: an inlink database, an outlink 
database, and a node database.  The inlink database is a listing of url and all 
of its inlinks.  The outlink database is a listing of url and all of its 
outlinks.  The node database is a listing of url with node meta information 
including the number of inlinks and outlinks, and eventually the score for that 
node.
  
  === Loops ===
- Once the web graph is built we can begin the process of link analysis.  Loops 
is an optional program that attempts to help weed out spam sites by determining 
link cycles in a web graph.  An example of a link cycle would be sites A, B, C, 
and D where A links to B which links to C which links to D which links back to 
A.  This program is computationally expensive and usually, due to time and 
space requirement, can't be run on more than a three or four level depth.  
While it does identify sites which appear to be spam and those links are then 
discounted in the later LinkRank program, its benefit to cost ratio is very 
low.  It is included in this package for completeness and because their may be 
a better way to perform this function with a different algorithm.  But on 
current production webgraphs, its use is discouraged.  Loops is found at 
org.apache.nutch.scoring.webgraph.Loops. Below is a printout of the programs 
usage.
+ Once the web graph is built we can begin the process of link analysis.  Loops 
is an optional program that attempts to help weed out spam sites by determining 
link cycles in a web graph.  An example of a link cycle would be sites A, B, C, 
and D, where A links to B which links to C which links to D which links back to 
A.  This program is computationally expensive and usually, due to time and 
space requirement, can't be run on more than a three or four level depth.  
While it does identify sites which appear to be spam and those links are then 
discounted in the later !LinkRank program, its benefit to cost ratio is very 
low.  It is included in this package for completeness and because there may be 
a better way to perform this function with a different algorithm.  But on 
current large production webgraphs, its 

[Nutch Wiki] Update of NewPage by DennisKubes

2009-01-12 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by DennisKubes:
http://wiki.apache.org/nutch/NewPage

The comment on the change is:
Beginning descriptions of how to use the new webgraph scoring system.

--
- emptyemptyempty!
+ This page describes the new scoring (i.e. WebGraph and Link Analysis) 
functionality in Nutch as of revision 723441.
  
+ == General Information ==
+ The new scoring functionality can be found in 
org.apache.nutch.scoring.webgraph.  This package contains multiple programs 
that build web graphs, perform a stable convergent link-analysis, and update 
the crawldb with those scores.  These programs assume that fetching cycles have 
already been completed and now the users want to build a global webgraph from 
those segments and from that webgraph perform link-analysis to get a single 
global relevancy score for each url.  Building a webgraph assumes that all 
links are stored in the current segments to be processed.  Links are not held 
over from one processing cycle to another.  Global link-analysis scores are 
based on the current links available and scores will change as the link 
structure of the webgraph changes.
+ 
+ Currently the scoring jobs are not integrated into the Nutch script as 
commands and must be run in the form bin/nutch 
org.apache.nutch.scoring.webgraph..
+ 
+ === WebGraph ===
+ The WebGraph program is the first job that must be run once all segments are 
fetched and ready to be processed.  WebGraph is found at 
org.apache.nutch.scoring.webgraph.WebGraph. Below is a printout of the programs 
usage.
+ 
+ {{{
+ usage: WebGraph
+  -help  show this help message
+  -segment segment the segment(s) to use
+  -webgraphdb webgraphdb   the web graph database to use
+ }}}
+ 
+ The WebGraph program can take multiple segments to process and requires an 
output directory in which to place the completed web graph components.  The 
WebGraph creates three different components, and inlink database, an outlink 
database, and a node database.  The inlink database is a listing of url and all 
of its inlinks.  The outlink database is a listing of url and all of its 
outlinks.  The node database is a listing of url with node meta information 
including the number of inlinks and outlinks, and eventually the score for that 
node.
+ 
+ === Loops ===
+ Once the web graph is built we can begin the process of link analysis.  Loops 
is an optional program that attempts to help weed out spam sites by determining 
link cycles in a web graph.  An example of a link cycle would be sites A, B, C, 
and D where A links to B which links to C which links to D which links back to 
A.  This program is computationally expensive and usually, due to time and 
space requirement, can't be run on more than a three or four level depth.  
While it does identify sites which appear to be spam and those links are then 
discounted in the later LinkRank program, its benefit to cost ratio is very 
low.  It is included in this package for completeness and because their may be 
a better way to perform this function with a different algorithm.  But on 
current production webgraphs, its use is discouraged.  Loops is found at 
org.apache.nutch.scoring.webgraph.Loops. Below is a printout of the programs 
usage.
+ 
+ {{{
+ usage: Loops
+  -help  show this help message
+  -webgraphdb webgraphdb   the web graph database to use
+ }}}
+ 
+ === LinkRank ===
+ With the web graph built we can now run LinkRank to perform an iterative link 
analysis.  LinkRank is a PageRank like link analysis program that converges to 
stable global scores for each url.  Similar to PageRank, the LinkRank program 
starts with a common score for all urls.  It then creates a global score for 
each url based on the number of incoming links and the scores for those link 
and the number of outgoing links from the page.  The process is iterative and 
scores tend to converge after a given number of iterations.  It is different 
from PageRank in that nepotistic links such as links internal to a website and 
reciprocal links between websites can be ignored.  The number of iterations can 
also be configured, by default 10 iterations are performed.  Unlike the 
previous OPIC scoring, the LinkRank program does not keep scores from one 
processing time to another.  The web graph and the link scores are recreated at 
each processing run and so we don't have the problems of ev
 er increasing scores.  LinkRank requires the WebGraph program to have 
completed successfully and it stores its output scores for each url in the node 
database of the webgraph. LinkRank is found at 
org.apache.nutch.scoring.webgraph.LinkRank. Below is a printout of the programs 
usage.  
+ 
+ {{{
+ usage: LinkRank
+  -help  show this help message
+  

[Nutch Wiki] Update of NewScoring by DennisKubes

2009-01-12 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by DennisKubes:
http://wiki.apache.org/nutch/NewScoring

New page:
This page describes the new scoring (i.e. WebGraph and Link Analysis) 
functionality in Nutch as of revision 723441.

== General Information ==
The new scoring functionality can be found in 
org.apache.nutch.scoring.webgraph.  This package contains multiple programs 
that build web graphs, perform a stable convergent link-analysis, and update 
the crawldb with those scores.  These programs assume that fetching cycles have 
already been completed and now the users want to build a global webgraph from 
those segments and from that webgraph perform link-analysis to get a single 
global relevancy score for each url.  Building a webgraph assumes that all 
links are stored in the current segments to be processed.  Links are not held 
over from one processing cycle to another.  Global link-analysis scores are 
based on the current links available and scores will change as the link 
structure of the webgraph changes.

Currently the scoring jobs are not integrated into the Nutch script as commands 
and must be run in the form bin/nutch org.apache.nutch.scoring.webgraph..

=== WebGraph ===
The WebGraph program is the first job that must be run once all segments are 
fetched and ready to be processed.  WebGraph is found at 
org.apache.nutch.scoring.webgraph.WebGraph. Below is a printout of the programs 
usage.

{{{
usage: WebGraph
 -help  show this help message
 -segment segment the segment(s) to use
 -webgraphdb webgraphdb   the web graph database to use
}}}

The WebGraph program can take multiple segments to process and requires an 
output directory in which to place the completed web graph components.  The 
WebGraph creates three different components, and inlink database, an outlink 
database, and a node database.  The inlink database is a listing of url and all 
of its inlinks.  The outlink database is a listing of url and all of its 
outlinks.  The node database is a listing of url with node meta information 
including the number of inlinks and outlinks, and eventually the score for that 
node.

=== Loops ===
Once the web graph is built we can begin the process of link analysis.  Loops 
is an optional program that attempts to help weed out spam sites by determining 
link cycles in a web graph.  An example of a link cycle would be sites A, B, C, 
and D where A links to B which links to C which links to D which links back to 
A.  This program is computationally expensive and usually, due to time and 
space requirement, can't be run on more than a three or four level depth.  
While it does identify sites which appear to be spam and those links are then 
discounted in the later LinkRank program, its benefit to cost ratio is very 
low.  It is included in this package for completeness and because their may be 
a better way to perform this function with a different algorithm.  But on 
current production webgraphs, its use is discouraged.  Loops is found at 
org.apache.nutch.scoring.webgraph.Loops. Below is a printout of the programs 
usage.

{{{
usage: Loops
 -help  show this help message
 -webgraphdb webgraphdb   the web graph database to use
}}}

=== LinkRank ===
With the web graph built we can now run LinkRank to perform an iterative link 
analysis.  LinkRank is a PageRank like link analysis program that converges to 
stable global scores for each url.  Similar to PageRank, the LinkRank program 
starts with a common score for all urls.  It then creates a global score for 
each url based on the number of incoming links and the scores for those link 
and the number of outgoing links from the page.  The process is iterative and 
scores tend to converge after a given number of iterations.  It is different 
from PageRank in that nepotistic links such as links internal to a website and 
reciprocal links between websites can be ignored.  The number of iterations can 
also be configured, by default 10 iterations are performed.  Unlike the 
previous OPIC scoring, the LinkRank program does not keep scores from one 
processing time to another.  The web graph and the link scores are recreated at 
each processing run and so we don't have the problems of ever
  increasing scores.  LinkRank requires the WebGraph program to have completed 
successfully and it stores its output scores for each url in the node database 
of the webgraph. LinkRank is found at 
org.apache.nutch.scoring.webgraph.LinkRank. Below is a printout of the programs 
usage.  

{{{
usage: LinkRank
 -help  show this help message
 -webgraphdb webgraphdb   the web graph db to use
}}}

=== ScoreUpdater ===
Once the LinkRank program has been run and link analysis is completed, the 
scores must be updated into the crawl database to work with the current Nutch 
functionality.  The 

[Nutch Wiki] Update of NewPage by DennisKubes

2009-01-12 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by DennisKubes:
http://wiki.apache.org/nutch/NewPage

--
- This page describes the new scoring (i.e. WebGraph and Link Analysis) 
functionality in Nutch as of revision 723441.
+ empty
  
- == General Information ==
- The new scoring functionality can be found in 
org.apache.nutch.scoring.webgraph.  This package contains multiple programs 
that build web graphs, perform a stable convergent link-analysis, and update 
the crawldb with those scores.  These programs assume that fetching cycles have 
already been completed and now the users want to build a global webgraph from 
those segments and from that webgraph perform link-analysis to get a single 
global relevancy score for each url.  Building a webgraph assumes that all 
links are stored in the current segments to be processed.  Links are not held 
over from one processing cycle to another.  Global link-analysis scores are 
based on the current links available and scores will change as the link 
structure of the webgraph changes.
- 
- Currently the scoring jobs are not integrated into the Nutch script as 
commands and must be run in the form bin/nutch 
org.apache.nutch.scoring.webgraph..
- 
- === WebGraph ===
- The WebGraph program is the first job that must be run once all segments are 
fetched and ready to be processed.  WebGraph is found at 
org.apache.nutch.scoring.webgraph.WebGraph. Below is a printout of the programs 
usage.
- 
- {{{
- usage: WebGraph
-  -help  show this help message
-  -segment segment the segment(s) to use
-  -webgraphdb webgraphdb   the web graph database to use
- }}}
- 
- The WebGraph program can take multiple segments to process and requires an 
output directory in which to place the completed web graph components.  The 
WebGraph creates three different components, and inlink database, an outlink 
database, and a node database.  The inlink database is a listing of url and all 
of its inlinks.  The outlink database is a listing of url and all of its 
outlinks.  The node database is a listing of url with node meta information 
including the number of inlinks and outlinks, and eventually the score for that 
node.
- 
- === Loops ===
- Once the web graph is built we can begin the process of link analysis.  Loops 
is an optional program that attempts to help weed out spam sites by determining 
link cycles in a web graph.  An example of a link cycle would be sites A, B, C, 
and D where A links to B which links to C which links to D which links back to 
A.  This program is computationally expensive and usually, due to time and 
space requirement, can't be run on more than a three or four level depth.  
While it does identify sites which appear to be spam and those links are then 
discounted in the later LinkRank program, its benefit to cost ratio is very 
low.  It is included in this package for completeness and because their may be 
a better way to perform this function with a different algorithm.  But on 
current production webgraphs, its use is discouraged.  Loops is found at 
org.apache.nutch.scoring.webgraph.Loops. Below is a printout of the programs 
usage.
- 
- {{{
- usage: Loops
-  -help  show this help message
-  -webgraphdb webgraphdb   the web graph database to use
- }}}
- 
- === LinkRank ===
- With the web graph built we can now run LinkRank to perform an iterative link 
analysis.  LinkRank is a PageRank like link analysis program that converges to 
stable global scores for each url.  Similar to PageRank, the LinkRank program 
starts with a common score for all urls.  It then creates a global score for 
each url based on the number of incoming links and the scores for those link 
and the number of outgoing links from the page.  The process is iterative and 
scores tend to converge after a given number of iterations.  It is different 
from PageRank in that nepotistic links such as links internal to a website and 
reciprocal links between websites can be ignored.  The number of iterations can 
also be configured, by default 10 iterations are performed.  Unlike the 
previous OPIC scoring, the LinkRank program does not keep scores from one 
processing time to another.  The web graph and the link scores are recreated at 
each processing run and so we don't have the problems of ev
 er increasing scores.  LinkRank requires the WebGraph program to have 
completed successfully and it stores its output scores for each url in the node 
database of the webgraph. LinkRank is found at 
org.apache.nutch.scoring.webgraph.LinkRank. Below is a printout of the programs 
usage.  
- 
- {{{
- usage: LinkRank
-  -help  show this help message
-  -webgraphdb webgraphdb   the web graph db to use
- }}}
- 
- === ScoreUpdater ===
- Once the LinkRank program has 

[Nutch Wiki] Update of FrontPage by DennisKubes

2009-01-12 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by DennisKubes:
http://wiki.apache.org/nutch/FrontPage

--
  
  == Nutch 2.0 ==
   * [Nutch2Architecture] -- Discussions on the Nutch 2.0 architecture.
+  * [NewScoring] -- New stable pagerank like webgraph and link-analysis jobs.
  
  == Other Resources ==
   * [http://nutch.sourceforge.net/blog/cutting.html Doug's Weblog] -- He's the 
one who originally wrote Lucene and Nutch.


[Nutch Wiki] Trivial Update of Release HOWTO by SamiSiren

2009-01-09 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by SamiSiren:
http://wiki.apache.org/nutch/Release_HOWTO

The comment on the change is:
remember to update doap.rdf

--
1. Copy release tar file to 
{{{people.apache.org:/www/www.apache.org/dist/lucene/nutch}}}.
  
1. Wait 24 hours for release to propagate to mirrors.
+ 
+ 1. Add the new release info to the 
[https://svn.apache.org/repos/asf/lucene/nutch/trunk/site/doap.rdf doap.rdf] 
file, and double check for any other updates that should be made to the doap 
file as well if it hasn't been updated in a while. 
+ 
1. Deploy new Nutch site (according to [Website Update HOWTO]).
1. Deploy new main Lucene site (according to [Website Update HOWTO] 
but modified for Lucene site - update is to be performed in 
{{{/www/lucene.apache.org}}} directory).
1. Update Javadoc in 
{{{people.apache.org:/www/lucene.apache.org/nutch/apidocs}}}.


[Nutch Wiki] Trivial Update of HttpPostAuthentication by susam

2008-12-05 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpPostAuthentication

--
  == Introduction ==
- Often, Nutch has to crawl websites with pages protected by authentication. 
Therefore, to crawl such web-pages, Nutch must authenticate itself to the 
website and then proceed with fetching the pages from it. Currently, the 
development version of Nutch can do Basic, Digest and NTLM] based 
authentication. This is documented in HttpAuthenticationSchemes. In this 
project, we would be adding HTTP POST based authentication, which is the most 
popular form of authentication on most websites. It should be possible to 
configure different credentials for different websites.
+ Often, Nutch has to crawl websites with pages protected by authentication. 
Therefore, to crawl such web-pages, Nutch must authenticate itself to the 
website and then proceed with fetching the pages from it. Currently, the 
development version of Nutch can do Basic, Digest and NTLM based 
authentication. This is documented in HttpAuthenticationSchemes. In this 
project, we would be adding HTTP POST based authentication, which is the most 
popular form of authentication on most websites. It should be possible to 
configure different credentials for different websites.
  
  == Configuration ==
  A configuration file with a list of domains for which authentication should 
be done along with the login URL and POST data. If possible, the configuration 
should also allow the user to mention a session timeout value for websites as 
an optional parameter. This would be helpful if some website is known to 
timeout very quickly, or when the duration of the fetch cycle would be too long 
as compared to the session's life.
@@ -23, +23 @@

  
   1. We use pattern matching to find out whether the contents of the page 
indicates it as an authentication failure page or not, for the website. But it 
is an unnecessary waste of time because for most cases the page wouldn't be an 
error page.
   1. We perform an authentication by sending POST data to login URL every time 
we fetch a page from that domain. By this, we are almost doubling the bandwidth 
requirement to crawl that website.
-  1. For those sites, where authentication failure page comes from a known 
URL, we can add which URLs mean authentication failure along with the login URL 
and POST data in the configuration file. There wouldn't be too many such URLs 
for a particular domain and so a regex match or a complete string match for the 
URLs after every response
+  1. For those sites, where authentication failure page comes from a known 
URL, we can add which URLs mean authentication failure along with the login URL 
and POST data in the configuration file. There wouldn't be too many such URLs 
for a particular domain and so a regex match or a complete string match for the 
URLs after every response from that domain shouldn't consume much time.
- from that domain shouldn't consume much time.
  
  However, even without taking care of these points, and simply getting the 
fetcher behavior right as discussed in the previous section, we'll have a 
solution that may be useful to many.
  


[Nutch Wiki] Update of RunNutchInEclipse0.9 by PiotrBazan

2008-12-03 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by PiotrBazan:
http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9

--
   * Name the project (Nutch_Trunk for instance)
   * Select Create project from existing source and use the location where 
you downloaded Nutch
   * Click on Next, and wait while Eclipse is scanning the folders
-  * Add the folder conf to the classpath (third tab and then add class 
folder)
+  * Add the folder conf to the classpath (third tab and then add class 
folder) 
+  * Go to Order and Export tab, find the entry for added conf folder and 
move it to the top. It's required to make eclipse take config 
(nutch-default.xml, nutch-final.xml, etc.) resources from our conf folder not 
anywhere else.
   * Eclipse should have guessed all the java files that must be added on your 
classpath. If it's not the case, add src/java, src/test and all plugin 
src/java and src/test folders to your source folders. Also add all jars in 
lib and in the plugin lib folders to your libraries 
   * Set output dir to tmp_build, create it if necessary
   * DO NOT add build to classpath
@@ -34, +35 @@

  
  === Configure Nutch ===
   * see the [http://wiki.apache.org/nutch/NutchTutorial Tutorial]
-  * change the property plugin.folders to ./src/plugin on 
$NUTCH_HOME/conf/nutch-defaul.xml
+  * change the property plugin.folders to ./src/plugin on 
$NUTCH_HOME/conf/nutch-defaul.xml 
   * make sure Nutch is configured correctly before testing it into Eclipse ;-)
  
  === missing org.farng and com.etranslate ===


[Nutch Wiki] Update of johnroman by johnroman

2008-12-01 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by johnroman:
http://wiki.apache.org/nutch/johnroman

--
- John Roman is a sysadmin for the RD arm of lexmark international.  
+ [http://nimbius.36bit.com/mered.jpg John Roman] is a sysadmin for the RD arm 
of Lexmark International.
  some of his contributions include bugfix documentation and 
troubleshooting...as well as an attempt to clean up alot of the tutorials.
  


[Nutch Wiki] Update of PluginCentral by johnroman

2008-11-26 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by johnroman:
http://wiki.apache.org/nutch/PluginCentral

--
   * WritingPluginExample - A step-by-step example of how to write a plugin for 
the 0.7 branch. - updated by LucasBoullosa
   * [http://wiki.media-style.com/display/nutchDocu/Write+a+plugin Writing 
Plugins] - by Stefan
  
- == Plugins that Come with Nutch (0.7) ==
+ == Plugins that Come with Nutch (0.9) ==
  
  In order to get Nutch to use any of these plugins, you just need to edit your 
conf/nutch-site.xml file and add the name of the plugin to the list of 
plugin.includes.
  
@@ -24, +24 @@

   * '''parse-html''' - Parses HTML documents
   * '''parse-js''' - Parses Java``Script
   * '''parse-mp3''' - Parses MP3s
+  * '''parse-zip''' - Parses ZIP archives
+  * '''parse-mspowerpoint''' - Parses Microsoft Powerpoint files
   * '''parse-msword''' - Parses MS Word documents
+  * '''parse-msexcel''' - Parses MS Excel documents
   * '''parse-pdf''' - Parses PDFs
   * '''parse-rss''' - Parses RSS feeds
+  * '''parse-oo''' - Parses OpenOffice files
+  * '''parse-swf''' - Parses Shockwave Flash
   * '''parse-rtf''' - Parses RTF files
   * '''parse-text''' - Parses text documents
   * '''protocol-file''' - Retreives documents from the filesystem
@@ -47, +52 @@

   * '''lib-commons-httpclient'''
   * '''lib-http'''
   * '''lib-jakarta-poi'''
-  * '''lib-log4j'''
+  * '''lib-log4j''' 
-  * '''lib-lucene-analyzers'''
+  * '''lib-lucene-analyzers''' - Lucene analyzers
-  * '''lib-nekohtml'''
-  * '''lib-parsems'''
+  * '''lib-nekohtml''' - automatic tag balancer 
+  * '''lib-parsems''' - parse ms documents framework
   * '''parse-msexcel''' - Parses MS Excel documents
   * '''parse-mspowerpoint''' - Parses MS Powerpoint documents
   * '''parse-oo''' - Parses Open Office and Star Office documents 
(Extentsions: ODT, OTT, ODH, ODM, ODS, OTS, ODP, OTP, SXW, STW, SXC, STC, SXI, 
STI)


[Nutch Wiki] Update of johnroman by johnroman

2008-11-25 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by johnroman:
http://wiki.apache.org/nutch/johnroman

New page:
John Roman is a sysadmin for the RD arm of lexmark international.  
some of his contributions include bugfix documentation and troubleshooting...as 
well as an attempt to clean up alot of the tutorials.


[Nutch Wiki] Update of Support by ThomasDelnoij

2008-11-11 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by ThomasDelnoij:
http://wiki.apache.org/nutch/Support

--
* [http://www.sigram.com Andrzej Bialecki] ab at sigram.com
* CNLP  http://www.cnlp.org/tech/lucene.asp
* [http://www.doculibre.com/ Doculibre Inc.] Open source and information 
management consulting. (Lucene, Nutch, Hadoop, Solr, Lius etc.) info at 
doculibre.com
-   * [http://www.dsen.nl DSEN - Java | J2EE | Agile Development  Consultancy]
+   * [http://www.dsen.nl Thomas Delnoij (DSEN) - Java | J2EE | Agile 
Development  Consultancy]
* eventax GmbH info at eventax.com
* [http://www.foofactory.fi/ FooFactory] / Sami Siren info at foofactory 
dot fi
* [http://www.lucene-consulting.com/ Lucene Consulting] - Nutch, Solr, 
Lucene, Hadoop consulting and development.  Founded by Otis Gospodnetic, 
[http://www.amazon.com/Lucene-Action-Otis-Gospodnetic/dp/1932394281 Lucene in 
Action] co-author.


[Nutch Wiki] Update of HelpContents by FuminZHAO

2008-10-17 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by FuminZHAO:
http://wiki.apache.org/nutch/HelpContents

--
+ deleted
- ##language:en
- == Help Contents ==
  
- Here is a tour of the most important help pages:
-  * HelpForBeginners - if you are new to wikis
-  * HelpOnNavigation - explains the navigational elements on a page
-  * HelpOnPageCreation - how to create a new page, and how to use page 
templates
-  * HelpOnUserPreferences - how to make yourself known to the wiki, and adapt 
default behaviour to your taste
-  * HelpOnEditing - how to edit a page
-  * HelpOnActions - tools that work on pages or the whole site
-  * HelpMiscellaneous - more details, and a FAQ section
- 
- These pages contain information only important to wiki administrators and 
developers:
-  * HelpOnAdministration - how to maintain a MoinMoin wiki
-  * HelpOnInstalling - how to install a MoinMoin wiki
-  * HelpForDevelopers - how to add your own features by changing the MoinMoin 
code
- 
- An automatically generated index of all help pages is on HelpIndex. See also 
HelpMiscellaneous/FrequentlyAskedQuestions for answers to frequently asked 
questions.
- 
- If you find any errors on the help pages, describe them on 
MoinMoin:HelpErrata. 
- 
- ''[Please do not add redundant information on these pages (which has to be 
maintained at two places then), and follow the established structure of help 
pages. Also note that the master set of help pages is not public, that this 
very page you read and all other help pages may be overwritten when the wiki 
software is updated. So if you have major contributions that should not get 
lost, send an extra notification notice to the MoinMoin user mailing list.]''
- 


[Nutch Wiki] Update of FindPage by FuminZHAO

2008-10-17 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by FuminZHAO:
http://wiki.apache.org/nutch/FindPage

--
+ deleted
- ##language:en
- You can use this page to search all entries in this WikiWikiWeb.  Searches 
are not case sensitive.
  
- Good starting points to explore a wiki are:
-  * RecentChanges: see where people are currently working
-  * FindPage: search or browse the database in various ways
-  * TitleIndex: a list of all pages in the wiki
-  * WordIndex: a list of all words that are part of page title (thus, a list 
of the concepts in a wiki)
-  * WikiSandBox: feel free to change this page and experiment with editing
- 
- Search '''wiki.apache.org''' using google:
- 
-  [[GoogleSearch]]
- 
- Here's a title search.  Try something like ''manager'':
- 
-  [[TitleSearch]]
- 
- Here's a full-text search.
- 
-  [[FullSearch]]
- 
- You can also use regular expressions, such as
- 
- {{{seriali[sz]e}}}
- 
- Or go direct to a page, or create a new page by entering its name here:
-   [[GoTo]]
- 


[Nutch Wiki] Update of PublicServers by Piratheep Mahenthiran

2008-10-02 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by Piratheep Mahenthiran:
http://wiki.apache.org/nutch/PublicServers

--
* [http://www.tokenizer.org Tokenizer] is an online shopping search engine 
partially powered by Nutch
  
* [http://www.utilitysearch.info/ UtilitySearch] is a search engine for the 
regulated utility industries (Electricity, Water, Gas, and Telecommunications) 
in the United States and Canada.
+   * [http://search.tamilsweb.com/ TamilSWeb Search] is a search engine geared 
toward south asian web content.
  


[Nutch Wiki] Update of HttpAuthenticationSchemes by susam

2008-10-01 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The comment on the change is:
troubleshooting tips and information to be provided while asking for help

--
  == Introduction ==
- 'protocol-httpclient' is a protocol plugin which supports retrieving 
documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with 
Basic, Digest and NTLM authentication schemes for web server as well as proxy 
server.
+ 'protocol-httpclient' is a protocol plugin which supports retrieving 
documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with 
Basic, Digest and NTLM authentication schemes for web server as well as proxy 
server. This feature can not do POST based authentication that depends on 
cookies. More information on this can be found at: HttpPostAuthentication
  
  == Necessity ==
- There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. Some portions of 'protocol-httpclient' 
were re-written to solve these problems, provide additional features like 
authentication support for proxy server and better inline documentation for the 
properties to be used to configure authentication. The author (Susam Pal) of 
these features has tested it in Infosys Technologies Limited by crawling the 
corporate intranet requiring NTLM authentication and this has been found to 
work well.
+ There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. Some portions of 'protocol-httpclient' 
were re-written to solve these problems, provide additional features like 
authentication support for proxy server and better inline documentation for the 
properties to be used to configure authentication.
  
  == JIRA NUTCH-559 ==
  These features were submitted as 
[https://issues.apache.org/jira/browse/NUTCH-559 JIRA NUTCH-559] in the JIRA. 
If you have checked out the latest Nutch trunk, you don't need to apply the 
patches. These features were included in the Nutch subversion repository in 
[http://svn.apache.org/viewvc?view=revrevision=608972 revision #608972]
@@ -91, +91 @@

  'protocol-httpclient' is based on 
[http://jakarta.apache.org/httpcomponents/httpclient-3.x/ Jakarta Commons 
HttpClient]. Some servers support multiple schemes for authenticating users. 
Given that only one scheme may be used at a time for authenticating, it must 
choose which scheme to use. To accompish this, it uses an order of preference 
to select the correct authentication scheme. By default this order is: NTLM, 
Digest, Basic. For more information on the behavior during authentication, you 
might want to read the 
[http://jakarta.apache.org/httpcomponents/httpclient-3.x/authentication.html 
HttpClient Authentication Guide].
  
  == Need Help? ==
- If you need help, please feel free to post your question to the 
[http://lucene.apache.org/nutch/mailing_lists.html#Users nutch-user mailing 
list].
+ If you need help, please feel free to post your question to the 
[http://lucene.apache.org/nutch/mailing_lists.html#Users nutch-user mailing 
list]. The author of this work, Susam Pal, usually responds to mails related to 
authentication problems. The DEBUG logs may be required to troubleshoot the 
problem. You must enable the debug log for 'protocol-httpclient' before running 
the crawler. To enable debug log for 'protocol-httpclient', open 
'conf/log4j.properties' and add the following line:
+ {{{
+ log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
+ }}}
  
+ It would be good to check the following things before asking for help.
+ 
+  1. Have you overridden the 'plugin.includes' property of 
'conf/nutch-default.xml' with 'conf/nutch-site.xml' and replaced 
'protocol-http' with 'protocol-httpclient'?
+  1. If you patched Nutch 0.9 source code manually with this patch, did you 
build the project before running the crawler?
+  1. Have you configured 'conf/httpclient-auth.xml'?
+  1. Do you see Nutch trying to fetch the pages you were expecting in 
'logs/hadoop.log'. You should see some logs like fetching 
http://www.example.com/expectedpage.html; where the URL is the page you were 
expecting to be fetched. If you don't see such lines for the pages you were 
expecting, the error is outside the scope of this feature. This feature comes 
into action only when the 

[Nutch Wiki] Update of Nutch0.9-Hadoop0.10-Tutorial by MarcinOkraszewski

2008-09-19 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by MarcinOkraszewski:
http://wiki.apache.org/nutch/Nutch0%2e9-Hadoop0%2e10-Tutorial

The comment on the change is:
Troubleshooting entry.

--
  
  See [http://wiki.apache.org/lucene-hadoop/HowManyMapsAndReduces] for more 
info about the number of map reduce tasks.
  
+ == Error when putting a file to DFS ==
+ If you get a similar error:
+ {{{
+ put: java.io.IOException: failed to create file /user/nutch/.test.crc on 
client 127.0.0.1 because target-length is 0, below MIN_REPLICATION (1) 
+ }}}
+ it may mean you do not have enough disc space. It happened to me with 90MB 
disk space available, Nutch 0.9/Hadoop 0.12.2. See also 
[http://www.mail-archive.com/[EMAIL PROTECTED]/msg09701.html mailing list 
message].
+ 


[Nutch Wiki] Update of RunningNutchAndSolr by PieterCoucke

2008-07-04 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by PieterCoucke:
http://wiki.apache.org/nutch/RunningNutchAndSolr

The comment on the change is:
svn path

--
  
  I'm posting it under Nutch rather than Solr on the presumption that people 
are more likely to be learning/using Solr first, then come here looking to 
combine it with Nutch.  I'm going to skip over doing command by command for 
right now.  I'm running/building on Ubuntu 7.10 using Java 1.6.0_05.  I'm 
assuming that the Solr trunk code is checked out into solr-trunk and Nutch 
trunk code is checked out into nutch-trunk.
  
-  1. Check out solr-trunk ( svn co http://svn.apache.org/repos/solr/ 
solr-trunk )
+  1. Check out solr-trunk ( svn co 
http://svn.apache.org/repos/asf/lucene/solr/ solr-trunk )
-  1. Check out nutch-trunk ( svn co http://svn.apache.org/repos/nutch/ 
nutch-trunk )
+  1. Check out nutch-trunk ( svn co 
http://svn.apache.org/repos/asf/lucene/nutch/ nutch-trunk )
   1. Go into the solr-trunk and run 'ant dist dist-solrj'
   1. Copy apache-solr-solrj-1.3-dev.jar and apache-solr-common-1.3-dev.jar 
from solr-trunk/dist to nutch-trunk/lib
   1. Apply patch from 
[http://www.foofactory.fi/files/nutch-solr/nutch_solr.patch FooFactory patch] 
to nutch-trunk (cd nutch-trunk; patch -p0  nutch_solr.patch)


[Nutch Wiki] Trivial Update of RunningNutchAndSolr by PieterCoucke

2008-07-04 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by PieterCoucke:
http://wiki.apache.org/nutch/RunningNutchAndSolr

--
 * Edit the imports to pick up org.apache.hadoop.util.ToolRunner
   1. Edit nutch-trunk/src/java/org/apache/nutch/indexer/Indexer.java changing 
scope on LuceneDocumentWrapper from private to protected
   1. Get the zip file from 
[http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html  
FooFactory] for SOLR-20
-  1. Unzip solr-client.zip somewhere, go into java/solr/src and run 'ant'
+  1. Unzip solr-client.zip somewhere, go into java/solrj and run 'ant'
   1. Copy solr-client.jar from dist to nutch-trunk/lib
   1. Copy xpp3-1.1.3.4.0.jar from lib to nutch-trunk/lib
   1. Configure nutch-trunk/conf/nutch-site.xml with *at least* settings for 
your site including a value for property indexer.solr.url (something like 
http://localhost:8983/solr/), but you should also have http.agent.name, 
http.agnet.description, http.agent.url, and http.agent.email as well.


[Nutch Wiki] Update of Nutch 0.9 Crawl Script Tutorial by AlessioTomasino

2008-05-18 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by AlessioTomasino:
http://wiki.apache.org/nutch/Nutch_0%2e9_Crawl_Script_Tutorial

--
  Please add comments / corrections to this document. 'cause I don't know what 
the heck I'm doing yet. :)
  One thing I want to figure out, is if I can inject just a subset of urls of 
pages that I know have changed since the last crawl and refetch/index only 
those pages. I think there is a way to do this using the adddays parameter 
maybe? anyone have any insight?
  
+ == How to refetch/index a subset of urls ==
+ 
+ My solution to this common question is to use a filter on the URL we want to 
refetch and have those expire using the -adddays option of 'nutch generate' 
command.
+ In nutch-site.xml you should enable a filter plugin such as urlfilter-regex 
and specify the file which contains the regex filter rules:
+ 
+ property
+ 
+ nameplugin.includes/name 
+ 
+ 
valueprotocol-http|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url|more)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|feed
 |'urlfilter-regex'/value
+ 
+ /property 
+ 
+ property
+   nameurlfilter.regex.file/name
+ 
+   valueregex-urlfilter.txt/value
+ /property
+ 
+ The file regex-urlfilter.txt can contain any regular expression, including 
one or more specific URLs we want to refetch/index, e.g.:
+ 
+ +http://myhostname/myurl.html
+ 
+ At this stage we can use the command $NUTCH_HOME/bin/nutch generate 
crawl/crawldb crawl/segments -adddays 31 to generate a segment and the output 
should look like:
+ 
+ Fetcher: starting
+ 
+ Fetcher: segment: crawl/segments/20080518090826
+ 
+ Fetcher: threads: 50
+ 
+ fetching http://myhostname/myurl.html
+ 
+ redirectCount=0
+ 
+ 
+ Any comments/feedback welcome!
+ 
+ 
+ 


[Nutch Wiki] Update of PublicServers by Finbar Dineen

2008-05-12 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by Finbar Dineen:
http://wiki.apache.org/nutch/PublicServers

--
* [http://www.bigsearch.ca/ Bigsearch.ca] uses nutch open source software 
to deliver its search results.
  
* [http://busytonight.com/ BusyTonight]: Search for any event in the United 
States, by keyword, location, and date. Event listings are automatically 
crawled and updated from original source Web sites.
+ 
+   * [http://www.centralbudapest.com/search Central Budapest Search] is a 
search engine for English language sites focussing on Budapest news, 
restaurants, accommodation, life and events.

* [http://circuitscout.com Circuit Scout] is a search engine for electrical 
circuits.
  


[Nutch Wiki] Update of FetchCycleOverlap by OtisGospodnetic

2008-05-07 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by OtisGospodnetic:
http://wiki.apache.org/nutch/FetchCycleOverlap

The comment on the change is:
This won't work 100% correctly - removing it so I don't mislead people

--
- Without overlapping jobs people running Nutch are likely not utilizing their 
clusters fully.  Thus, here is a recipe for overlapping jobs:
+ deleted
  
- 0. imagine a cluster with M max maps and R max reduces (say M=R=8)
- 
- 1. run generate job with -numFetchers equal to M-2
- 
- 2. run a fetcher job (uses M-2 maps and later all R reduces)
- 
- 3. at this point, while the fetch job is still running, there are 2 open map 
slots for something else to run, say the updatedb job for the previously 
fetched/parsed segment
- 
- 4. when updatedb job is done the cluster can take on more jobs.  Any 
completed tasks (C) from the running fetcher job represent open work slots
- 
- 5. start another fetch job.  This will be able to use only C tasks, but C 
will grow as the first job opens up more slots, eventually hitting M-2 open 
slots.
- 
- 6. at some point, the fetch job from 2) above will complete, opening up 2 map 
slots, so updatedb can be run, even in the background, allowing the execution 
to go back to 1)
- 
- Because a URL is locked out for 7 days after the generate step included it 
into a fetchlist, the above cycle needs to complete within 7 days.  In more 
detail:
- 
- Generate updates the CrawlDb so that urls selected
- for the latest fetchlist become locked out for the next 7 days. This
- means that you can happily generate multiple fetchlists, and fetch them
- out of order, and then do the DB updates out of order, as you see fit,
- so long as you make it within the 7 days of the lock out period.
- 
- This means that it's practical to limit the numFetchers to a number
- below your cluster capacity, because then you can run other maintenance
- jobs in parallel with the currently running fetch job (such as updatedb
- and generate of next fetchlists).
- 


[Nutch Wiki] Update of Nutch2Architecture by DennisKubes

2008-04-24 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by DennisKubes:
http://wiki.apache.org/nutch/Nutch2Architecture

The comment on the change is:
Changed DI for configuration and reflection utils

--
  == Overview ==
* Reuse of existing Nutch codebase
  * While some things will change this architecture is more of a refactor 
than a complete re-write.  Much of the existing codebase including plugin 
functionality should be reused.
-   * Dependency Injection
- * Remove the plugin framework and use a DI framework, Spring for example, 
to create mapper and reducer classes that are auto injected with dependencies.  
This will take modifications to the Hadoop codebase.
+   * Remove the plugin framework
+ * After some experimenting, DI using spring or another similar framework 
presents problems.  Good news is that we can achieve the same thing using the 
configuration objects from hadoop along with creating new instances using 
reflectionutils.  This is more service locator than dependency injection but it 
still gives us the same benefits.
+ * Have the ability to change the jobconfiguration settings for tools.  
This can be accomplished through some type of properties file on the classpath 
and would be useful for testing, for example the ability to switch out an 
outputformat to see the output in text format.
  * Have mock objects that make it easy to test jobs.
  
  == Data Structures ==


[Nutch Wiki] Update of FetchCycleOverlap by OtisGospodnetic

2008-04-23 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by OtisGospodnetic:
http://wiki.apache.org/nutch/FetchCycleOverlap

--
  
  2. run a fetcher job (uses M-2 maps and later all R reduces)
  
- 3. at this point there are 2 open map slots for something else to run, say 
the updatedb job for the previously fetched/parsed segment
+ 3. at this point, while the fetch job is still running, there are 2 open map 
slots for something else to run, say the updatedb job for the previously 
fetched/parsed segment
  
  4. when updatedb job is done the cluster can take on more jobs.  Any 
completed tasks (C) from the running fetcher job represent open work slots
  


[Nutch Wiki] Update of GettingNutchRunningWithDebian by StevenHayles

2008-04-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by StevenHayles:
http://wiki.apache.org/nutch/GettingNutchRunningWithDebian

The comment on the change is:
Added installation  of tomcat5.5-webapps without it home page it blank

--
   ''export JAVA_HOME''[[BR]]
  
  ==  Install Tomcat5.5 and Verify that it is functioning ==
-  ''# apt-get install tomcat5.5 libtomcat5.5-java tomcat5.5-admin ''[[BR]]
+  ''# apt-get install tomcat5.5 libtomcat5.5-java tomcat5.5-admin 
tomcat5.5-webapps''[[BR]]
  Verify Tomcat is running:[[BR]]
   ''# /etc/init.d/tomcat5.5 status''[[BR]]
   ''#Tomcat servlet engine is running with Java pid 
/var/lib/tomcat5.5/temp/tomcat5.5.pid''[[BR]]


[Nutch Wiki] Update of FetchCycleOverlap by OtisGospodnetic

2008-04-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by OtisGospodnetic:
http://wiki.apache.org/nutch/FetchCycleOverlap

New page:
Without overlapping jobs people running Nutch are likely not utilizing their 
clusters fully.  Thus, here is a recipe for overlapping jobs:

0. imagine a cluster with M max maps and R max reduces (say M=R=8)

1. run generate job with -numFetchers equal to M-2

2. run a fetcher job (uses M-2 maps and later all R reduces)

3. at this point there are 2 open map slots for something else to run, say the 
updatedb job for the previously fetched/parsed segment

4. when updatedb job is done the cluster can take on more jobs.  Any completed 
tasks (C) from the running fetcher job represent open work slots

5. start another fetch job.  This will be able to use only C tasks, but C will 
grow as the first job opens up more slots, eventually hitting M-2 open slots.

6. at some point, the fetch job from 2) above will complete, opening up 2 map 
slots, so updatedb can be run, even in the background, allowing the execution 
to go back to 1)

Because a URL is locked out for 7 days after the generate step included it 
into a fetchlist, the above cycle needs to complete within 7 days.  In more 
detail:

Generate updates the CrawlDb so that urls selected
for the latest fetchlist become locked out for the next 7 days. This
means that you can happily generate multiple fetchlists, and fetch them
out of order, and then do the DB updates out of order, as you see fit,
so long as you make it within the 7 days of the lock out period.

This means that it's practical to limit the numFetchers to a number
below your cluster capacity, because then you can run other maintenance
jobs in parallel with the currently running fetch job (such as updatedb
and generate of next fetchlists).


<    1   2