from:"Apache Wiki"

[Nutch Wiki] Trivial Update of RunningNutchAndSolr by NickTkach

2008-04-14 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by NickTkach:
http://wiki.apache.org/nutch/RunningNutchAndSolr

The comment on the change is:
Changed fields for copyField line to correct values

--
 * Add the fields that Nutch needs (url, content, segment, digest, host, 
site, anchor, title, tstamp, text--see 
[http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html 
FooFactory Article on Nutch + Solr])
 * Change defaultSearchField to 'text'
 * Change defaultOperator to 'AND'
-* Add lines to copyField section to copy cat  name into the text field
+* Add lines to copyField section to copy anchor, title, and content into 
the text field
   1. Start the Solr you just made (cd /tmp/mysolr; java -jar start.jar)
   1. Run a Nutch crawl using the bin/crawl.sh script.

[Nutch Wiki] Update of Nutch2Architecture by DennisKubes

2008-04-10 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by DennisKubes:
http://wiki.apache.org/nutch/Nutch2Architecture

New page:
= Nutch 2.0 Architecture = 

The purpose of this page is to discuss ideas for the architecture of the next 
generation of Nutch.

[Nutch Wiki] Update of RunningNutchAndSolr by uygar bayar

2008-04-07 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by uygar bayar:
http://wiki.apache.org/nutch/RunningNutchAndSolr

--
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.nutch.indexer.SolrIndexer.main(SolrIndexer.java:93)
  ---
+ Sorry but nothing change!! Same as below..
+ 
  ERROR
  I changed lines and it worked.But this time gave this error. I tried both 
private and protected scopes but nothing changed.
  I also changed this line  Document doc = (Document) ((ObjectWritable) 
value).get(); with this  Document doc = (Document) ((NutchWritable) 
value).get(); this time gave build error..

[Nutch Wiki] Update of RunningNutchAndSolr by uygar bayar

2008-04-04 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by uygar bayar:
http://wiki.apache.org/nutch/RunningNutchAndSolr

--
  '''Troubleshooting:'''
   * If you get errors about Type mismatch in value from map: (expected 
ObjectWritable, but received NutchWritable), then you likely are missing the 
two steps I just added in step 9 above.  Sorry about that, I forgot about 
making the change there in SolrIndexer.
  ---
+ ERROR
  I did everything but i got this error any idea??
  
  2008-04-03 15:42:28,009 WARN  mapred.LocalJobRunner - job_local_1
@@ -46, +47 @@

  at org.apache.nutch.indexer.SolrIndexer.run(SolrIndexer.java:111)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.nutch.indexer.SolrIndexer.main(SolrIndexer.java:93)
+ ---
+ ERROR
+ I changed lines and it worked.But this time gave this error. I tried both 
private and protected scopes but nothing changed.
+ I also changed this line  Document doc = (Document) ((ObjectWritable) 
value).get(); with this  Document doc = (Document) ((NutchWritable) 
value).get(); this time gave build error..
  
+ 2008-04-04 10:41:48,490 WARN  mapred.LocalJobRunner - job_local_1
+ 
+ java.lang.ClassCastException: 
org.apache.nutch.indexer.Indexer$LuceneDocumentWrapper
+ at 
org.apache.nutch.indexer.SolrIndexer$OutputFormat$1.write(SolrIndexer.java:135)
+ at org.apache.hadoop.mapred.ReduceTask$2.collect(ReduceTask.java:315)
+ at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:275)
+ at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:52)
+ at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:333)
+ at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:164)
+ 2008-04-04 10:41:49,085 FATAL indexer.Indexer - SolrIndexer: 
java.io.IOException: Job failed!
+ at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:894)
+ at org.apache.nutch.indexer.SolrIndexer.index(SolrIndexer.java:87)
+ at org.apache.nutch.indexer.SolrIndexer.run(SolrIndexer.java:112)
+ at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
+ at org.apache.nutch.indexer.SolrIndexer.main(SolrIndexer.java:94)
+

[Nutch Wiki] Trivial Update of RunningNutchAndSolr by NickTkach

2008-04-04 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by NickTkach:
http://wiki.apache.org/nutch/RunningNutchAndSolr

The comment on the change is:
Corrected line 3 of instructions (should have been nutch-trunk)

--
  
   1. Check out solr-trunk and nutch-trunk
   1. Go into the solr-trunk and run 'ant dist dist-solrj'
-  1. Get zip from [http://variogram.com/latest/SolrIndexer.zip  Variogr.am] 
and unzip it to solr-trunk
+  1. Get zip from [http://variogram.com/latest/SolrIndexer.zip  Variogr.am] 
and unzip it to nutch-trunk.
   1. Copy apache-solr-solrj-1.3-dev.jar and apache-solr-common-1.3-dev.jar to 
nutch-trunk/lib
   1. Get the zip file from 
[http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html  
FooFactory] for SOLR-20
   1. Unzip solr-client.zip somewhere, go into java/solr/src and run 'ant'

[Nutch Wiki] Update of RunningNutchAndSolr by NickTkach

2008-04-04 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by NickTkach:
http://wiki.apache.org/nutch/RunningNutchAndSolr

--
  
  '''Troubleshooting:'''
   * If you get errors about Type mismatch in value from map: (expected 
ObjectWritable, but received NutchWritable), then you likely are missing the 
two steps I just added in step 9 above.  Sorry about that, I forgot about 
making the change there in SolrIndexer.
+  * Note that I've changed the steps above.  I was mistaken-you don't need the 
zip from Variogr.am.  You only need the files from FooFactory.  If you take 
that part out then you shouldn't see errors about ClassCastExceptions any more.
  ---
  ERROR
  I did everything but i got this error any idea??

[Nutch Wiki] Update of RunningNutchAndSolr by uygar bayar

2008-04-03 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by uygar bayar:
http://wiki.apache.org/nutch/RunningNutchAndSolr

--
  
  If you watch the output from your Solr instance (logs) you should see a bunch 
of messages scroll by when Nutch finishes crawling and posts new documents.  If 
not, then you've got something not configured right.  I'll try to add more 
notes here as people have questions/issues.
  
+ ---
+ I did everything but i got this error any idea??
+ 
+ 2008-04-03 15:42:28,009 WARN  mapred.LocalJobRunner - job_local_1
+ java.io.IOException: Type mismatch in value from map: expected 
org.apache.hadoop.io.ObjectWritable, recieved 
org.apache.nutch.crawl.NutchWritable
+ at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:369)
+ at org.apache.nutch.indexer.Indexer.map(Indexer.java:344)
+ at org.apache.nutch.indexer.Indexer.map(Indexer.java:52)
+ at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
+ at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
+ at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:132)
+ 2008-04-03 15:42:28,609 FATAL indexer.Indexer - SolrIndexer: 
java.io.IOException: Job failed!
+ at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:894)
+ at org.apache.nutch.indexer.SolrIndexer.index(SolrIndexer.java:86)
+ at org.apache.nutch.indexer.SolrIndexer.run(SolrIndexer.java:111)
+ at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
+ at org.apache.nutch.indexer.SolrIndexer.main(SolrIndexer.java:93)
+

[Nutch Wiki] Trivial Update of RunningNutchAndSolr by NickTkach

2008-04-03 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by NickTkach:
http://wiki.apache.org/nutch/RunningNutchAndSolr

--
  
  If you watch the output from your Solr instance (logs) you should see a bunch 
of messages scroll by when Nutch finishes crawling and posts new documents.  If 
not, then you've got something not configured right.  I'll try to add more 
notes here as people have questions/issues.
  
+ '''Troubleshooting:'''
+  * If you get errors about Type mismatch in value from map: (expected 
ObjectWritable, but received NutchWritable), then you likely are missing the 
two steps I just added in step 9 above.  Sorry about that, I forgot about 
making the change there in SolrIndexer.
  ---
  I did everything but i got this error any idea??

[Nutch Wiki] Update of FrontPage by IOrtega

2008-03-25 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by IOrtega:
http://wiki.apache.org/nutch/FrontPage

--
   * [IndexStructure]
   * [Getting Started]
   * JavaDemoApplication - A simple demonstration of how to use the Nutch APIin 
a Java application
- 
+  * InstallingWeb2
  == Other Resources ==
   * [http://nutch.sourceforge.net/blog/cutting.html Doug's Weblog] -- He's the 
one who originally wrote Lucene and Nutch.
   * [http://wiki.media-style.com/display/nutchDocu/Home Stefan's Nutch 
Documentation]

[Nutch Wiki] Update of InstallingWeb2 by IOrtega

2008-03-25 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by IOrtega:
http://wiki.apache.org/nutch/InstallingWeb2

New page:
chris sleeman wrote:
 Hi,
 
 Can anyone tell me how to use the spell-check query plugin available in the
 contrib \ web2 dir (and even the rest of the plugins too)? Is it similar to
 enabling the nutch-plugins?

Following these steps should get you there:

1. compile nutch (in top level dir do ant)

2. crawl your data (see tutorial)

3. edit your conf/nutch-site.xml so it contains plugin
web-query-propose-spellcheck and webui-extensionpoints

4. edit conf/nutch-site.xml so it contains proper dir for plugins as the
plugins are not packaged inside .war (something like
property
nameplugin.folders/name
value path to plugins dir /value
/property
)

5. compile web2 plugins (in contrib/web2 do ant compile-plugins)

6. edit search.jsp contains line tiles:insert definition=propose
ignore=true/ just before the second c:choose.

7. create web2 app (in contrib/web2 do ant war)

8. build your spell check index ( bin/nutch plugin
web-query-propose-spellcheck org.apache.nutch.spell.NGramSpeller -i
indexdir -f content -o spelling

9. deploy webapp to tomcat

10. start tomcat (from the dir you have your crawl data and ngram index
generated in #7)

11. search for something that is spelled incorrectly

 Also how do we build the spelling index ? Are these plugins still WIP ? I

see #8 above, the whole web is MWSN (More Work Still Needed:)

 haven't been able to find any docs on these.

That's because there currently is not any other documentation but the
readme in
http://svn.apache.org/viewvc/lucene/nutch/trunk/contrib/web2/README.txt?view=markup

I should probably put some documentation to wiki to gain more attraction

fyi - I just committed a small fix to bug that might prevent spell
checking proposer from working. So if you have problems check out the
trunk or a nightly build tomorrow.

-- 
 Sami Siren

[Nutch Wiki] Update of FAQ by MarkDeSpain

2008-03-19 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by MarkDeSpain:
http://wiki.apache.org/nutch/FAQ

The comment on the change is:
changed answer for authentication question from Unkown to existing Wiki link 

--
  
   How can I fetch pages that require Authentication? 
  
- Unknown.
+ See HttpAuthenticationSchemes.
  
  === Updating ===

[Nutch Wiki] Update of FrontPage by DanFrost

2008-03-06 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by DanFrost:
http://wiki.apache.org/nutch/FrontPage

The comment on the change is:
Sorry - our server was moved last night so the site was down. Pls view our site.

--
   * 
[http://wiki.apache.org/nutch-data/attachments/FrontPage/attachments/Hadoop-Nutch%200.8%20Tutorial%2022-07-06%20%3CNavoni%20Roberto%3E
 Tutorial Hadoop+Nutch 0.8 night build Roberto Navoni 24-07-06]
   * [http://blog.foofactory.fi/ FooFactory] Nutch and Hadoop related posts
   * [http://spinn3r.com Spinn3r] [http://spinn3r.com/opensource.php Open 
Source components] (our contribution to the crawling OSS community with more to 
come).
+  * [http://www.interadvertising.co.uk/blog/nutch_logos Larger / better 
quality Nutch logos] Re-created Nutch logos available in GIF, PNG  EPS in 
resolutions up to 1200 x 449

[Nutch Wiki] Trivial Update of FrontPage by DanFrost

2008-03-05 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by DanFrost:
http://wiki.apache.org/nutch/FrontPage

The comment on the change is:
How complicated can it be? I'm so dumb... put 2 x http:// in there

--
   * 
[http://wiki.apache.org/nutch-data/attachments/FrontPage/attachments/Hadoop-Nutch%200.8%20Tutorial%2022-07-06%20%3CNavoni%20Roberto%3E
 Tutorial Hadoop+Nutch 0.8 night build Roberto Navoni 24-07-06]
   * [http://blog.foofactory.fi/ FooFactory] Nutch and Hadoop related posts
   * [http://spinn3r.com Spinn3r] [http://spinn3r.com/opensource.php Open 
Source components] (our contribution to the crawling OSS community with more to 
come).
-  * [http://http://www.interadvertising.co.uk/blog/nutch_logos Larger / better 
quality Nutch logos] Re-created Nutch logos available in GIF, PNG  EPS in 
resolutions up to 1200 x 449
+  * [http://www.interadvertising.co.uk/blog/nutch_logos Larger / better 
quality Nutch logos] Re-created Nutch logos available in GIF, PNG  EPS in 
resolutions up to 1200 x 449

[Nutch Wiki] Update of PublicServers by PeterRaines

2008-03-05 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by PeterRaines:
http://wiki.apache.org/nutch/PublicServers

--
  
* [http://www.jcintersonic.com/ JC Intersonic] uses nutch as its search 
engine.
  
+   * [http://www.jumblefox.com.au/ Jumble Fox] - The Australian Search Engine
+ 
* [http://krugle.com Krugle] uses Nutch to crawl web pages for code, 
archives and technically-interesting content. We also use a modified version of 
Nutch to crawl CVS/Subversion repositories, and the NutchBean/distributed 
searcher support to search and generate hits for code and tech info queries.
  
* [http://www.labforculture.org LabforCulture] - The essential tool for 
everyone in arts and culture who creates, collaborates, shares and produces 
across borders in Europe.

[Nutch Wiki] Update of WritingPluginExample by JasperKamperman

2008-02-28 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by JasperKamperman:
http://wiki.apache.org/nutch/WritingPluginExample

--
- ''This was written for the 0.7 branch.  For an example using the 0.8 code, 
see [wiki:WritingPluginExample-0.8 this page]''
+ ''This was written for the 0.7 branch.  For an example using the 0.8 code, 
see [wiki:WritingPluginExample-0.8 this page]''. For an example using the 0.9 
code, see [wiki:WritingPluginExample-0.9 this page]
  
  == The Example ==

[Nutch Wiki] Trivial Update of HttpAuthenticationSchemes by susam

2008-02-22 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The comment on the change is:
fixed typo

--
  However, an important thing to note here is that if some page of example:8080 
requires authentication in another realm, say, 'mail', authentication would not 
be done even though the second set of credentials is defined as default. Of 
course this doesn't affect authentication for other web servers and the default 
authscope would be used for other web-servers. This problem occurs only for 
those web-servers which have authentication scopes defined for a few selected 
realms/schemes. This is discussed in next section.
  
  === Catch-all Authentication Scope for a Web Server ===
- When one or more authentication scopes are defined for a particular web 
server (host:port), then the default credentials is ignored for that host:port 
combination. Therefore, an catch-all authentication scope to handle all other 
realms and scopes must be specified explicitly as shown below.
+ When one or more authentication scopes are defined for a particular web 
server (host:port), then the default credentials is ignored for that host:port 
combination. Therefore, a catch-all authentication scope to handle all other 
realms and scopes must be specified explicitly as shown below.
  
  {{{
  credentials username=susam password=masus

[Nutch Wiki] Update of NutchTutorial by MarioMendez

2008-02-20 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by MarioMendez:
http://wiki.apache.org/nutch/NutchTutorial

--
  
   {{{ http://lucene.apache.org/nutch/ }}}
  
-  * Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the 
name of the domain you wish to crawl. For example, if you wished to limit the 
crawl to the apache.org domain, the line should read:
+  * Edit the file conf/crawl-urlfilter.txt (it works for me when I used the 
file conf/regex-urlfilter.txt) and replace MY.DOMAIN.NAME with the name of the 
domain you wish to crawl. For example, if you wished to limit the crawl to the 
apache.org domain, the line should read:
  
   {{{ +^http://([a-z0-9]*\.)*apache.org/ }}}

[Nutch Wiki] Trivial Update of GettingNutchRunningOnCygwin by MatMcGowan

2008-02-19 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by MatMcGowan:
http://wiki.apache.org/nutch/GettingNutchRunningOnCygwin

--
  
  The firther problem is the path setting for the sshd service. After ssh'ing 
in, the Windows hostname is still used (although on my installation $PATH 
appeared correct, e.g. {{{ssh localhost echo $PATH; type hostname}}}, showed 
a correct path with {{{/usr/bin:/bin}}} in the path before the windows 
directories, yet type was finding the windows version of the file.) 
  
- To fix the path setting, add the PATH environment variable to the sshd 
service. Under the key 
{{{HKLM\System\CurrentControlSet\Services\sshd\Parameters\Environment}}} create 
a new string value PATH, set to {{{/usr/bin:/usr/lib:/bin:$PATH}}}. After 
restarting the sshd service, return the tests above, and the {{{^M}}} should no 
longer be present.
+ To fix the path setting, add the PATH environment variable to the sshd 
service. Under the key 
{{{HKLM\System\CurrentControlSet\Services\sshd\Parameters\Environment}}} create 
a new string value PATH, set to {{{/usr/bin:/usr/lib:/bin:%PATH%}}}. This 
prepends /user/bin etc. to the PATH defined in Windows. After restarting the 
sshd service, return the tests above, and the {{{^M}}} should no longer be 
present.
  
  See also http://www.cygwin.com/ml/cygwin/2007-07/msg00045.html

[Nutch Wiki] Update of GettingNutchRunningOnCygwin by MatMcGowan

2008-02-13 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by MatMcGowan:
http://wiki.apache.org/nutch/GettingNutchRunningOnCygwin

New page:
= Problems and workarounds for running nutch on cygwin =

I followed the NutchHadoopTutorial and encountered a few problems, which after 
looking for solutions, seems to have affected others trying to do nutch on 
cygwin.


== Line Endings ==

The tutorial mentions using dos2unix for all commands, i.e. 
{{{
dos2unix /nutch/search/bin/*.sh /nutch/search/bin/hadoop /nutch/search/bin/nutch
dos2unix /nutch/search/conf/*.sh
}}}
But this also applies to the slave (and master) files. i.e.
{{{
dos2unix /nutch/search/conf/slaves
}}}

Without this, the hostname passed to ssh will include a trailing '\r', 
producing no address assocaited with name error.

== Logging files ==
The hostname command is used to construct logfile names. This command is 
included in Windows and in cygwin. When the windows version is used, an 
additional '\r' is included in the command output, causing the logfile name to 
be an invalid filename. Errors such as Head: cannot open filename for 
reading: no such file or directory occur, even though the name looks ok.

You can see the problem first hand by running
{{{
ssh localhost hostname | cat -v
}}}

The output will include {{{^M}}} after the hostname.

The first cause of the problem is that hostname is not installed under cygwin. 
To get this, install coreutils from the base category.

The firther problem is the path setting for the sshd service. After ssh'ing in, 
the Windows hostname is still used (although on my installation $PATH appeared 
correct, e.g. {{{ssh localhost echo $PATH; type hostname}}}, showed a correct 
path with {{{/usr/bin:/bin}}} in the path before the windows directories, yet 
type was finding the windows version of the file.) 

To fix the path setting, add the PATH environment variable to the sshd service. 
Under the key 
{{{HKLM\System\CurrentControlSet\Services\sshd\Parameters\Environment}}} create 
a new string value PATH, set to {{{/usr/bin:/usr/lib:/bin:$PATH}}}. After 
restarting the sshd service, return the tests above, and the {{{^M}}} should no 
longer be present.

See also http://www.cygwin.com/ml/cygwin/2007-07/msg00045.html


== Impersonated SSH Account ==

When using the cygwin sshd, it is necessary to first ssh in before running NDFS 
commands (e.g. bin/hadoop -put urls urls.) This is to ensure the current user 
account is constistent with later ssh sessions. (Even if you ssh in as the same 
user you are running locally, the sshd service may use a different user 
account.)

With my setup, I had a nutch shortcut to cygwin.bat that was started using 
runas.exe, to launch the nutch user. NDFS commands would then write files to 
/user/nutch. But, after running ssh, supposedly logging in as the same user, 
the NDFS stores files under /user/sshd, because the current user account was 
in fact the sshd account. 

On Windows 2003 Server, cygwin sshd is not able to log in users under their 
actual account. If you ssh in and enter the command

{{{
%SystemRoot%\System32\whoami.exe
}}}

It will display the account name running the sshd service, and not the user you 
expected (e.g. nutch.) If you simply type
{{{
whoami
}}}
then it will print nutch or whichever user you ssh'ed in as. When HDFS runs, 
it's the native username that it sees, i.e. the account running the sshd 
service.

Logging in first with ssh before doing anything with NDFS ensures that all 
files are created using the same account name in the HDFS hierarchy (the sshd 
service account.) Not doing this, the first files created by local commands 
(e.g. hadoop dfs -put urls) will go to the nutch user (or the user running 
the cygwin shell) while subsequent commands run on remote machines will go 
under the sshd account folder in HDFS.

As a result of this, the sshd account also needs write access to the nutch 
folder.

[Nutch Wiki] Update of HttpAuthenticationSchemes by susam

2008-02-05 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The comment on the change is:
role of http.agent.host in NTLM and patch committed

--
  == Necessity ==
  There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. Some portions of 'protocol-httpclient' 
were re-written to solve these problems, provide additional features like 
authentication support for proxy server and better inline documentation for the 
properties to be used to configure authentication. The author (Susam Pal) of 
these features has tested it in Infosys Technologies Limited by crawling the 
corporate intranet requiring NTLM authentication and this has been found to 
work well.
  
- == Download ==
- Currently, these features are present in the form of a patch in JIRA. 
Download the patch from [https://issues.apache.org/jira/browse/NUTCH-559 JIRA 
NUTCH-559] and apply it to trunk. The latest patch is named as 
[https://issues.apache.org/jira/secure/attachment/12370428/NUTCH-559v0.5.patch 
NUTCH-559v0.5.patch].
+ == JIRA NUTCH-559 ==
+ These features were submitted as 
[https://issues.apache.org/jira/browse/NUTCH-559 JIRA NUTCH-559] in the JIRA. 
If you have checked out the latest Nutch trunk, you don't need to apply the 
patches. These features were included in the Nutch subversion repository in 
[http://svn.apache.org/viewvc?view=revrevision=608972 revision #608972]
  
  == Introduction to Authentication Scope ==
  Different credentials for different authentication scopes can be configured 
in 'conf/httpclient-auth.xml'. If a set of credentials is configured for a 
particular authentication scope (i.e. particular host, port number, realm 
and/or scheme), then that set of credentials would be sent only to pages 
falling under the specified authentication scope.
@@ -82, +82 @@

   1. For authscope tag, 'host' and 'port' attribute should always be 
specified. 'realm' and 'scheme' attributes may or may not be specified 
depending on your needs. If you are tempted to omit the 'host' and 'port' 
attribute, because you want the credentials to be used for any host and any 
port for that realm/scheme, please use the 'default' tag instead. That's what 
'default' tag is meant for.
   1. One authentication scope should not be defined twice as different 
authscope tags for different credentials tag. However, if this is done by 
mistake, the credentials for the last defined authscope tag would be used. 
This is because, the XML parsing code, reads the file from top to bottom and 
sets the credentials for authentication-scopes. If the same authentication 
scope is encountered once again, it will be overwritten with the new 
credentials. However, one should not rely on this behavior as this might change 
with further developments.
   1. Do not define multiple authscope tags with the same host, port but 
different realms if the server requires NTLM authentication. This can means 
there should not be multiple tags with same host, port, scheme=NTLM but 
different realms. If you are omitting the scheme attribute and the server 
requires NTLM authentication, then there should not be multiple tags with same 
host, port but different realms. This is discussed more in the next section.
+  1. If you are using NTLM scheme, you should also set the 'http.agent.host' 
property in conf/nutch-site.xml
  
  === A note on NTLM domains ===
- NTLM does not use the concept of realms. Therefore, multiple realms for a 
web-server can not be defined as different authentication scopes for the same 
web-server requiring NTLM authentication. There should be exactly one authscope 
tag for NTLM scheme authentication scope for a particular web-server. The 
authentication domain should be specified as the value of the 'realm' attribute.
+ NTLM does not use the concept of realms. Therefore, multiple realms for a 
web-server can not be defined as different authentication scopes for the same 
web-server requiring NTLM authentication. There should be exactly one authscope 
tag for NTLM scheme authentication scope for a particular web-server. The 
authentication domain should be specified as the value of the 'realm' 
attribute. NTLM authentication also requires the name of IP address of the host 
on which the crawler is running. Thus, 'http.agent.host' should be set properly.
  
  == Underlying HttpClient Library ==
  'protocol-httpclient' is based on 
[http://jakarta.apache.org/httpcomponents/httpclient-3.x/ Jakarta Commons 
HttpClient]. Some servers support multiple schemes for

[Nutch Wiki] Update of FrontPage by DennisKubes

2008-02-03 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by DennisKubes:
http://wiki.apache.org/nutch/FrontPage

The comment on the change is:
Added Upgrading Hadoop in Nutch

--
   * [Nutch - The Java Search Engine] (Builds on the basic tutorials. 
Includes index maintenance scripts)
   * [:NutchHadoopTutorial:Nutch Hadoop Tutorial] - How to setup Nutch and 
Hadoop over a cluster of machines
   * [:Automating_Fetches_with_Python:Automating Fetches with Python] - How to 
automatic the Nutch fetching process using Python
+  * [:Upgrading_Hadoop:Upgrading Hadoop Version in Nutch] - Basic steps for 
upgrading Hadoop in Nutch.
   * [FAQ]
   * [:CommandLineOptions:Commandline] options for 0.7.x
   * [:08CommandLineOptions:Commandline] options for version 0.8

[Nutch Wiki] Update of Upgrading Hadoop by DennisKubes

2008-02-03 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by DennisKubes:
http://wiki.apache.org/nutch/Upgrading_Hadoop

The comment on the change is:
Added initial document

New page:
The purpose of this document is to show how to upgrade the version of hadoop 
within Nutch.

 * Download the latest release version of Hadoop.  It is preferred to download 
a release version instead of building from source because the release contains 
the Hadoop binaries and are build by Hadoop QA.  If you download from source 
you will need to build the C libs.  Also remember that the name used to build 
hadoop will appear in the hadoop admin screens.  If you are upgrading Hadoop 
for the Nutch release, it is preferred to download the latest binary release 
for Hadoop.

 * Unzip the release and copy the lib/native/* directories into your clean 
Nutch trunk workspace under trunk/lib/native where trunk is the root of the 
Nutch Trunk.  You will also want to copy the hadoop-core jar from the root of 
the hadoop release into the trunk/lib directory.

 * Remove the *.la files from the trunk/lib/native/OS directories (ex. 
trunk/lib/native/Linux-i386-32/libhadoop.la).  These are just script files and 
are not needed for the release.  You will also want to remove any older 
versions of the hadoop-core jar from the trunk/lib directory.

 * If there are any errors or code that needs to be changed because of Hadoop 
API upgrades, that would need to happen here.

 * Do a full clean and build of Nutch through the ant clean and package targets.

 * Run the full test suite for Nutch using the ant test target.

 * It is best to run a few full fetches and indexes using the new Hadoop 
versions.  If this is not possible, see if you can build a drop and allow 
others to run some fetches.  It is best to do this using Nutch in a distributed 
mode.

 * Once all tests have passed and a few fetch cycles have been run, post a 
patch with the relevant changes.  Then following the standard commit rules for 
wait time before commit, you can commit into the nutch repository.  Make sure 
to change the trunk/CHANGES.txt file to relect the hadoop upgrade and any 
significant hadoop API changes that may have occurred.

[Nutch Wiki] Update of PublicServers by FuadEfendi

2008-02-01 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by FuadEfendi:
http://wiki.apache.org/nutch/PublicServers

The comment on the change is:
I used source code: HTTP protocol, URL filtering  normalization, parser, robots

--
  
* [http://www.synoo.com:8080 Synoo.com] is a small web search engine
  
+   * [http://www.tokenizer.org Tokenizer] is an online shopping search engine 
partially powered by Nutch
+ 
* [http://www.utilitysearch.info/ UtilitySearch] is a search engine for the 
regulated utility industries (Electricity, Water, Gas, and Telecommunications) 
in the United States and Canada.

[Nutch Wiki] Update of JavaApplication by ChazHickman

2008-01-25 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by ChazHickman:
http://wiki.apache.org/nutch/JavaApplication

--
  = Integrating Nutch search functionality into a Java application =
+ 
+ This example is the fruit of much searching of the nutch users mailing list 
in order to get a working applpication that used the Nutch APIs.  Icouldn't 
find all that was needed to provide a quick-start in one place, so this 
document was born...
  
  Using Nutch within an application is actually very simple; the requirements 
are merely the existence of a previously created crawl index, a couple of 
settings in a configuration file, and a handful of jars in your classpath. 
Nothing else is needed from the Nutch release that you can download.
  
@@ -75, +77 @@

  }
  }}}
  
+ Chaz Hickman (Jan 2008)
+

[Nutch Wiki] Update of AboutPlugins by ViksitGaur

2008-01-24 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by ViksitGaur:
http://wiki.apache.org/nutch/AboutPlugins

The comment on the change is:
Added a short note on pluginRepository

--
  
  In order to get Nutch to use a given plugin, you need to edit your 
conf/nutch-site.xml file and add the name of the plugin to the list of 
plugin.includes.
  
+ == Using a Plugin From The Command Line ==
+ 
+ Nutch ships with a number of plugins that include a main() method, and sample 
code to illustrate their use. These plugins can be used from the command line - 
a good way to start exploring the internal workings of each plugin.
+ 
+ To do so, you need to use the bin/nutch script from the $NUTCH_HOME directory,
+ 
+ $ bin/nutch plugin
+ Usage: PluginRepository pluginId className [arg1 arg2 ...]
+ 
+ As an example, if you wanted to execute the parse-html plugin, 
+ 
+ $ bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser 
filename.html
+ 
+ The PluginRepository is the name of the plugin itself, and the pluginId is 
the fully qualified name of the plugin class.
+ 
+ 
   See also: WritingPluginExample
  
   See also: HowToContribute

[Nutch Wiki] Update of Support by fl

2008-01-23 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by fl:
http://wiki.apache.org/nutch/Support

--
* Stefan Groschupf sg at media-style.com
* Michael Nebel mn at nebel.de (germany preferred)
* Objects Search
+   * [http://www.ingate.de INGATE GmbH]
* [http://www.intrafind.de IntraFind Software AG]
* Michael Rosset mrosset at btmeta.com
* Supreet Sethi supreet at linux-delhi.org (india preferred)

[Nutch Wiki] Update of luca facoetti by luca facoetti

2008-01-10 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by luca facoetti:
http://wiki.apache.org/nutch/luca_facoetti

New page:
#format wiki
#language fr
== ModÃ¨le des pages d'aide ==
Texte.

=== Exemple ===
{{{
xxx
}}} 

=== Affichage ===
xxx

[Nutch Wiki] Update of NonDefaultIntranetCrawlingOptions by JasonKull

2008-01-10 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by JasonKull:
http://wiki.apache.org/nutch/NonDefaultIntranetCrawlingOptions

New page:
##language:en
== Options for intranet crawling that are not enabled by default ==
Here are some options you might want to add to your conf/nutch-site.xml 
configuration file if you plan on crawling your local network intranet that are 
not enabled by default.

=== Enable additional parser plugins ===
{{{
property
nameplugin.includes/name

valueprotocol-http|urlfilter-regex|parse-(text|html|js|pdf|msexcel|mspowerpoint|msword|pdf|rss|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value
/property
}}} 

This will enable the parser plugins for text, html, javascript, pdf, excel, 
powerpoint, word, pdf, rss and zip. There are additional parsers you can enable 
which are listed in conf/parse-plugins.xml. If you have additional document 
type you wish to parse and they are listed in the parse-plugins file, just add 
them to the list.



=== Increase the file size fetch limit ===
{{{
property
namehttp.content.limit/name value2097152/value
/property
}}} 

This will increase the default file size fetching limit to 2 megabytes. If your 
documents are larger (such as PDFs) then increase the number appropriately.

[Nutch Wiki] Trivial Update of RunNutchInEclipse0.9 by JakeVanderdray

2008-01-06 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by JakeVanderdray:
http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9

--
  
  === Create a new java project in Eclipse ===
   * File  New  Project  Java project  click Next
+  * Name the project (Nutch_Trunk for instance)
-  * select Create project from existing source and use the location where 
you downloaded Nutch
+  * Select Create project from existing source and use the location where 
you downloaded Nutch
-  * click on Next, and wait while Eclipse is scanning the folders
+  * Click on Next, and wait while Eclipse is scanning the folders
-  * add the folder conf to the classpath (third tab and then add class 
folder)
+  * Add the folder conf to the classpath (third tab and then add class 
folder)
   * Eclipse should have guessed all the java files that must be added on your 
classpath. If it's not the case, add src/java, src/test and all plugin 
src/java and src/test folders to your source folders. Also add all jars in 
lib and in the plugin lib folders to your libraries 
-  * set output dir to tmp_build, create it if necessary
+  * Set output dir to tmp_build, create it if necessary
   * DO NOT add build to classpath

[Nutch Wiki] Update of Becoming A Nutch Developer by JakeVanderdray

2008-01-04 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by JakeVanderdray:
http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer

The comment on the change is:
Small fixes (mostly spelling)

--
  Me
  }}}
  
- With this type of question other users would have no idea what the problem is 
or how to help and therefore most simply ignore the question and move on.  
Other the other hand here is better example of asking questions.
+ With this type of question other users would have no idea what the problem is 
or how to help and therefore most simply ignore the question and move on.  On 
the other hand here is better example of asking questions.
  
  __A Good Email__
  {{{
@@ -97, +97 @@

   * [http://www.mail-archive.com/index.php?hunt=nutch Nutch Mail Archive]
   * [http://www.nabble.com/forum/Search.jtp?query=nutch Nabble Nutch]
  
- When searching the list for errors you have recieved it is good to search 
both by component, for example fetcher, and by the actual error recieved.  If 
you are not finding the answers you are looking for on the list, you may want 
to move to the JIRA and search there for answers.
+ When searching the list for errors you have received it is good to search 
both by component, for example fetcher, and by the actual error received.  If 
you are not finding the answers you are looking for on the list, you may want 
to move to the JIRA and search there for answers.
  
- Here are some other important things to remember about the mailing lists.  
First, do not cross post questions.  Find the best list for you question and 
post your it to that list only.  Posting the same question to multiple lists 
(i.e. user and dev) tends to annoy the very people you are wanting to help you. 
 Second, remember that developers and committers have day jobs and deadlines 
also and that being rude, offensive, or aggressive is a sure way to get your 
posting ignored if not flamed.
+ Here are some other important things to remember about the mailing lists.  
First, do not cross post questions.  Find the best list for your question and 
post it to that list only.  Posting the same question to multiple lists (i.e. 
user and dev) tends to annoy the very people you want help from.  Second, 
remember that developers and committers have day jobs and deadlines also and 
that being rude, offensive, or aggressive is a sure way to get your posting 
ignored if not flamed.
  
  Most questions on the lists are answered within a day.  If you ask a question 
and it is not answered for a couple of days, do not repost the same question.  
Instead you may need to reword your question, provide more information, or give 
a better description in the subject.
  
   Step Two: Learning the Nutch Source Code 
  I have found that when teaching new developers the basics of the Nutch source 
code it is easiest to first start with learning the operations of a full crawl 
from start to finish.
  
- A word about Hadoop.  As soon as you start looking into Nutch code (versions 
.8 or higher) you will be looking at code that uses and extends Hadoop APIs.  
Learning the Hadoop source code base is as big an endeavour as learning the 
Nutch codebase, but because of how much Nutch relies on Hadoop, anyone serious 
about Nutch develop will also need to learn the Hadoop codebase.
+ A word about Hadoop.  As soon as you start looking into Nutch code (versions 
.8 or higher) you will be looking at code that uses and extends Hadoop APIs.  
Learning the Hadoop source code base is as big an endeavor as learning the 
Nutch codebase, but because of how much Nutch relies on Hadoop, anyone serious 
about Nutch develop will also need to learn the Hadoop codebase.
  
  First start by getting Nutch up and running and having completed a full 
process of fetching through indexing.  There are tutorials on this wiki that 
show how to do this.  Second get Nutch setup to run in an integrated 
development environment such as Eclipse.  There are also tutorials that show 
how to accomplish this.  Once this is done you should be able to run individual 
Nutch components inside of a debugger.  This is essential because probably the 
fastest way to learn the Nutch codebase is to step through different components 
in a debugger.
  
@@ -177, +177 @@

  
  Then please be patient and in the mean time start working on another issue. 
Committers are busy people too. If no one responds to your patch after a few 
days, please make friendly reminders to the dev mailing list. Please 
incorporate other's suggestions into into your patch if you think they're 
reasonable.
  
- Now here is the hard part.  Even if you have completed your patch it may not 
make it into the final Nutch codebase.  This could be for any number of reason 
but most often it is because the piece of functionality is not in lines with

[Nutch Wiki] Update of NutchHadoopTutorial by WillPugh

2008-01-03 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by WillPugh:
http://wiki.apache.org/nutch/NutchHadoopTutorial

--
* When you first start up hadoop, there's a warning in the namenode log, 
dfs.StateChange - DIR* FSDirectory.unprotectedDelete: failed to remove 
e:/dev/nutch-0.8/filesystem/mapreduce/.system.crc because it does not exist - 
You can ignore that.
* If you get errors like, failed to create file [...] on client [foo] 
because target-length is 0, below MIN_REPLICATION (1) this means a block could 
not be distributed. Most likely there is no datanode running, or the datanode 
has some severe problem (like the lock problem mentioned above).
  
+ 
+   
+   * This tutorial worked well for me, however, I ran into a problem where my 
crawl wasn't working.   Turned out, it was because I needed to set the user 
agent and other properties for the crawl.  If anyone is reading this, and 
running into the same problem, look at the updated tutorial 
http://wiki.apache.org/nutch/Nutch0%2e9-Hadoop0%2e10-Tutorial?highlight=%28hadoop%29%7C%28tutorial%29
+

[Nutch Wiki] Update of HttpAuthenticationSchemes by susam

2007-12-31 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The comment on the change is:
root element omitted - restructured the sentence

--
  When authentication is required to fetch a resource from a web-server, the 
authentication-scope is determined from the host and port obtained from the URL 
of the page. If it matches any 'authscope' in this configuration file, then the 
'credentials' for that 'authscope' is used for authentication.
  
  == Configuration ==
- Since the example and explanation provided as comments in 
'conf/httpclient-auth.xml' is very brief, therefore this section would explain 
it in a little more detail. The root element is auth-configuration for all 
the examples below which has been omitted for the sake of clarity.
+ Since the example and explanation provided as comments in 
'conf/httpclient-auth.xml' is very brief, therefore this section would explain 
it in a little more detail. In all the examples below, the root element 
auth-configuration has been omitted for the sake of clarity.
  
  === Crawling an Intranet with Default Authentication Scope ===
  Let's say all pages of an intranet are protected by basic, digest or ntlm 
authentication and there is only one set of credentials to be used for all web 
pages in the intranet, then a configuration as described below is enough. This 
is also the simplest possible configuration possible for authentication schemes.

[Nutch Wiki] Update of FAQ by DanielNaber

2007-12-16 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change
notification.

The following page has been changed by DanielNaber:
http://wiki.apache.org/nutch/FAQ

The comment on the change is:
mention OPICScoringFilter

--
You can tweak your conf/common-terms.utf8 file after creating an index
through the following command:
bin/nutch org.apache.nutch.indexer.HighFreqTerms -count 10 -nofreqs index

- What ranking algorithm is used in searches ? Does Nutch use the
[http://en.wikipedia.org/wiki/HITS_algorithm HITS algorithm] ?
-
- N/A yet
-
How is scoring done in Nutch? (Or, explain the explain page?)

- Nutch is built on Lucene. To understand Nutch scoring, study how Lucene does
it. The formula Lucene uses scoring can be found at the head of the Lucene
Similarity class in the
[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html
Lucene Similarity Javadoc]. Roughly, the score for a particular document in a
set of query results, score(q,d), is the sum of the score for each term of a
query (t in q). A terms score in a document is itself the sum of the term run
against each field that comprises a document (title is one field, url
another. A document is a set of fields). Per field, the score is the
product of the following factors: Its td (term freqency in the document), a
score factor idf (usually a factor made up of frequency of term relative to
amount of docs in index), an index-time boost, a normalization of count of
terms found relative to size of document (lengthNorm), a similar
normalization is done for the term in the query i
tself (queryNorm), and finally, a factor with a weight for how many
instances of the total amount of terms a particular document contains. Study
the lucene javadoc to get more detail on each of the equation components and
how they effect overall score.
+ Nutch is built on Lucene. To understand Nutch scoring, study how Lucene does
it. The formula Lucene uses scoring can be found at the head of the Lucene
Similarity class in the
[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html
Lucene Similarity Javadoc]. Roughly, the score for a particular document in a
set of query results, score(q,d), is the sum of the score for each term of a
query (t in q). A terms score in a document is itself the sum of the term run
against each field that comprises a document (title is one field, url
another. A document is a set of fields). Per field, the score is the
product of the following factors: Its tf (term freqency in the document), a
score factor idf (usually a factor made up of frequency of term relative to
amount of docs in index), an index-time boost, a normalization of count of
terms found relative to size of document (lengthNorm), a similar
normalization is done for the term in the query i
tself (queryNorm), and finally, a factor with a weight for how many
instances of the total amount of terms a particular document contains. Study
the lucene javadoc to get more detail on each of the equation components and
how they effect overall score.

Interpreting the Nutch explain.jsp, you need to have the above cited Lucene
scoring equation in mind. First, notice how we move right as we move from
score total, to score per query term, to score per query document field
(A document field is not shown if a term was not found in a particular field).
Next, studying a particular field scoring, it comprises a query component and
then a field component. The query component includes query time -- as opposed
to index time -- boost, an idf that is same for the query and field
components, and then a queryNorm. Similar for the field component
(fieldNorm is an aggregation of certain of the Lucene equation components).

How can I influence Nutch scoring?

+ Scoring is implemented as a filter plugin, i.e. an implementation of the
!ScoringFilter class. By default,
[http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/scoring/opic/OPICScoringFilter.html
OPICScoringFilter] is used.
+
- The easiest way to influence scoring is to change query time boosts (Will
require edit of nutch-site.xml and redeploy of the WAR file). Query-time boost
by default looks like this:{{{
+ However, the easiest way to influence scoring is to change query time boosts
(Will require edit of nutch-site.xml and redeploy of the WAR file). Query-time
boost by default looks like this:{{{
query.url.boost, 4.0f
query.anchor.boost, 2.0f
query.title.boost, 1.5f

[Nutch Wiki] Update of FAQ by DanielNaber

2007-12-16 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by DanielNaber:
http://wiki.apache.org/nutch/FAQ

The comment on the change is:
small cleanup

--
  
   Are there any mailing lists available? 
  
- There's a user, developer, commits and agents lists, all available at 
http://lucene.apache.org/nutch/mailing_lists.html#Agents .
+ There's a user, developer, commits and agents lists, all available at 
http://lucene.apache.org/nutch/mailing_lists.html.
- 
-  Is there a mail archive? 
- 
- Yes: http://www.mail-archive.com/nutch-user%40lucene.apache.org/maillist.html 
or http://www.nabble.com/Nutch-f362.html .
  
   How can I stop Nutch from crawling my site? 
  
- Please visit Ã¶ur [http://lucene.apache.org/nutch/bot.html webmaster info 
page]
+ Please visit our [http://lucene.apache.org/nutch/bot.html webmaster info 
page]
  
   Will Nutch be a distributed, P2P-based search engine? 
  
@@ -29, +25 @@

  
   Will Nutch use a distributed crawler, like Grub? 
  
- Distributed crawling can save download bandwidth, but, in the long run, the 
savings is not significant. A successful search engine requires more bandwidth 
to upload query result pages than its crawler needs to download pages, so 
making the crawler use less bandwidth does not reduce overall bandwidth 
requirements. The dominant expense of operating a search engine is not 
crawling, but searching.
+ Distributed crawling can save download bandwidth, but, in the long run, the 
savings is not significant. A successful search engine requires more bandwidth 
to upload query result pages than its crawler needs to download pages, so 
making the crawler use less bandwidth does not reduce overall bandwidth 
requirements. The dominant expense of operating a large search engine is not 
crawling, but searching.
  
   Won't open source just make it easier for sites to manipulate rankings? 

  
@@ -58, +54 @@

  nutch-site.xml is where you make the changes that override the default 
settings.
  The same goes to the servlet container application.
  
-  My system does not find the segments folder. Why? OR How do I tell the 
''Nutch Servlet'' where the index file are located? 
+  My system does not find the segments folder. Why? Or: How do I tell the 
''Nutch Servlet'' where the index file are located? 
  
  There are at least two choices to do that:
  
@@ -95, +91 @@

  
   What happens if I inject urls several times? 
  
- Urls, which are already in the database, won't be injected.
+ Urls which are already in the database, won't be injected.
  
  === Fetching ===

[Nutch Wiki] Update of HttpAuthenticationSchemes by susam

2007-12-05 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The comment on the change is:
re-writing document as per latest v0.5 patch

--
  'protocol-httpclient' is a protocol plugin which supports retrieving 
documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with 
Basic, Digest and NTLM authentication schemes for web server as well as proxy 
server.
  
  == Necessity ==
- There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. Some portions of 'protocol-httpclient' 
were re-written to solve these problems, provide additional features like 
authentication support for proxy server and better inline documentation for the 
properties to be used in 'nutch-site.xml' to enable 'protocol-httpclient' and 
use its authentication features. The author (Susam Pal) of these features has 
tested it in Infosys Technologies Limited by crawling the corporate intranet 
requiring NTLM authentication and this has been found to work well.
+ There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. Some portions of 'protocol-httpclient' 
were re-written to solve these problems, provide additional features like 
authentication support for proxy server and better inline documentation for the 
properties to be used in 'httpclient-auth.xml' to enable 'protocol-httpclient' 
and use its authentication features. The author (Susam Pal) of these features 
has tested it in Infosys Technologies Limited by crawling the corporate 
intranet requiring NTLM authentication and this has been found to work well.
  
  == Download ==
- Currently, these features are present in the form of a patch in JIRA. 
Download the patch from [https://issues.apache.org/jira/browse/NUTCH-559 JIRA 
NUTCH-559] and apply it to trunk.
+ Currently, these features are present in the form of a patch in JIRA. 
Download the patch from [https://issues.apache.org/jira/browse/NUTCH-559 JIRA 
NUTCH-559] and apply it to trunk. The latest patch is named as 
[https://issues.apache.org/jira/secure/attachment/12370428/NUTCH-559v0.5.patch 
NUTCH-559v0.5.patch].
  
  == Configuration ==
- This is an advanced feature that lets the user specify different credentials 
for different authentication scopes. This section does not describe the default 
configuration. Some parts of this section might be outdated. It is better to 
read the guidelines in 'conf/httpclient-auth.xml' because they are correct. 
This section will be improved later when time permits.
+ Since the example and explanation provided as comments in 
'conf/httpclient-auth.xml' is very crisp, therefore this section would explain 
it in more details. The section starts with a few very simple examples which 
would suffice for most real life situations. Complex cases are described later 
in this article. The root element is auth-configuration for all the examples 
below which has been omitted for the sake of clarity.
+ 
+ === Crawling an intranet with default authentication scope ===
+ Let's say all pages of an intranet are protected by basic, digest or ntlm 
authentication and there is only one set of credentials to be used for all web 
pages in the intranet, then a configuration as described below is enough. This 
is also the simplest possible configuration possible for authentication schemes.
+ 
+ {{{credentials username=susam password=masus
+  default/
+ /credentials}}}
+ 
+ The credentials specified above would be sent to any page requesting 
authentication. Though it is extremely simple, default authentication scope 
should be used with caution. This set of credentials would be sent to any 
web-page requesting for authentication and therefore, a malicious user can 
steal the credentials used in the configuration by setting up a web-page 
requiring Basic authentication. Therefore, we usually use credentials set apart 
for crawling only so that even if a user steals the credentials, he wouldn't be 
able to do anything harmful. If you are sure, that all pages in the intranet 
use a particular authentication scheme, say, NTLM, then this situation can be 
improved a little in this manner.
+ 
+ {{{credentials username=susam password=masus
+  default scheme=ntlm/
+ /credentials}}}
+ 
+ Thus, these set of credentials would be sent to pages

[Nutch Wiki] Update of Crawl by susam

2007-11-20 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/Crawl

--
  {{{
  #!/bin/sh
  
+ # runbot script to run the Nutch bot for crawling and re-crawling.
+ # Usage: bin/runbot [safe]
+ #If executed in 'safe' mode, it doesn't delete the temporary
+ #directories generated during crawl. This might be helpful for
+ #analysis and recovery in case a crawl fails.
+ #
  # Author: Susam Pal
- #
- # 'runbot' script to crawl and re-crawl using Nutch 0.9 and Nutch 1.0
- #
- # Modify the values of the variables in the beginning to alter the
- # behaviour of the script. The script accepts only one argument 'safe'
- # to run the script in safe mode. e.g. bin/runbot safe
- # Safe mode prevents deletion of temporary directories so that recovery
- # action can be taken if anything goes wrong during the crawl.
  
  depth=2
  threads=5

[Nutch Wiki] Update of FrontPage by peterpuwang

2007-11-15 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by peterpuwang:
http://wiki.apache.org/nutch/FrontPage

--
  == Nutch Administration ==
   * DownloadingNutch
   * HardwareRequirements
-  * [http://peterpuwang.googlepages.com/NutchGuideForDummies.htm Tutorial] -- 
Latest step by Step Installation guide for dummies: Nutch 0.9.
+  * '''[http://peterpuwang.googlepages.com/NutchGuideForDummies.htm Tutorial] 
-- Latest step by Step Installation guide for dummies: Nutch 0.9.'''
   * [http://lucene.apache.org/nutch/tutorial.html Tutorial] -- A Step-by-Step 
guide to getting Nutch up and running.
   * NutchTutorial ''on the wiki''
   * [Nutch - The Java Search Engine] (Builds on the basic tutorials. 
Includes index maintenance scripts)

[Nutch Wiki] Update of HttpAuthenticationSchemes by susam

2007-11-04 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The comment on the change is:
removed conf/nutch-site.xml conf

--
  == Download ==
  Currently, these features are present in the form of a patch in JIRA. 
Download the patch from [https://issues.apache.org/jira/browse/NUTCH-559 JIRA 
NUTCH-559] and apply it to trunk.
  
+ == Configuration ==
+ This is an advanced feature that lets the user specify different credentials 
for different authentication scopes. This section does not describe the default 
configuration. Some parts of this section might be outdated. It is better to 
read the guidelines in 'conf/httpclient-auth.xml' because they are correct. 
This section will be improved later when time permits.
- == Common Credentials Configuration ==
- This is the simplest possible configuration which involves setting just one 
set of credentials. It is useful in trusted Intranets where all sites are 
trusted and require the same username/password for authentication.
- 
- === Quick Guide ===
-  1. Include 'protocol-httpclient' in 'plugin.includes'.
-  1. For basic or digest authentication in proxy server, set 
'http.proxy.username' and 'http.proxy.password'. Also, set 'http.proxy.realm' 
if you want to specify a realm  as the authentication scope.
-  1. For NTLM authentication in proxy server, set 'http.proxy.username', 
'http.proxy.password', 'http.proxy.realm' and 'http.auth.host'. 
'http.proxy.realm' is the NTLM domain name. 'http.auth.host' is the host where 
the crawler is running.
-  1. For basic or digest authentication in web servers, set 
'http.auth.username' and 'http.auth.password'. Also, set 'http.auth.realm' if 
you want to specify a realm as the authentication scope.
-  1. For NTLM authentication in proxy server, set 'http.auth.username', 
'http.auth.password', 'http.auth.realm' and 'http.auth.host'. 'http.auth.realm' 
is the NTLM domain name. 'http.auth.host' is the host where the crawler is 
running.
- 
- This is explained in details in the following section.
- 
- === Details ===
- To use 'protocol-httpclient', 'conf/nutch-site.xml has to be edited to 
include some properties which is explained in this section. First and foremost, 
to enable the plugin, this plugin must be added in the 'plugin.includes' of 
'nutch-site.xml'. So, this property would typically look like:-
- 
- {{{property
-   nameplugin.includes/name
-   valueprotocol-httpclient|urlfilter-regex|.../value
-   description.../description
- /property}}}
- 
- (... indicates a long line truncated)
- 
- Next, if authentication is required for proxy server, the following 
properties need to be set in 'conf/nutch-site.xml'.
- 
-  * http.proxy.username
-  * http.proxy.password
-  * http.proxy.realm (If a realm needs to be provided. In case of NTLM 
authentication, the domain name should be provided as its value.)
-  * http.auth.host (This is required in case of NTLM authentication only. This 
is the host where the crawler would be running.)
- 
- If the web servers of the intranet are in a particular domain or realm and 
requires authentication, these properties should be set in 
'conf/nutch-site.xml'.
- 
-  * http.auth.username
-  * http.auth.password
-  * http.auth.realm
-  * http.auth.host
- 
- The explanation for these properties are similar to that of the proxy 
authentication properties. As you might have noticed, 'http.auth.host' is used 
for proxy NTLM authentication as well as web server NTLM authentication. Since, 
the host at which the HTTP requests are originating are same for both, so the 
same property is used for both and two different properties were not created.
- 
- Even though, the 'http.auth.host' property is required only for NTLM 
authentication, it is advisable to set this for all cases, because, in case the 
crawler comes across a server which requires NTLM authentication (which you 
might not have anticipated), the crawler can still fetch the page.
- 
- == Authentication Scope Specific Credentials ==
- This is an advanced feature that lets the user specify different credentials 
for different authentication scopes.
  
  === Quick Guide ===
  An example of 'conf/httpclient-auth.xml' configuration is provided below:
@@ -98, +57 @@

  
  The 'realm' attribute is optional in authscope tag and it can be omitted if 
you want the credentials to be used for all realms on a particular web-server 
(or all remaining realms as shown in the Quick Guide section above). One 
authentication scope should not be defined twice as different authscope tags 
for different credentials tag. However, if this is done by mistake, the 
credentials for the last defined authscope tag would be used. This is 
because, the XML parsing code, reads the file from top to bottom and sets the 
credentials for

[Nutch Wiki] Update of FrontPage by peterpuwang

2007-11-03 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by peterpuwang:
http://wiki.apache.org/nutch/FrontPage

--
  == Nutch Administration ==
   * DownloadingNutch
   * HardwareRequirements
-  * [http://www.thechristianlife.com/z/NutchGuideForDummies.htm Tutorial] -- 
Latest step by Step Installation guide for dummies: Nutch 0.9.
+  * [http://thechristianlife.re-invent.net/z/NutchGuideForDummies.htm 
Tutorial] -- Latest step by Step Installation guide for dummies: Nutch 0.9.
   * [http://lucene.apache.org/nutch/tutorial.html Tutorial] -- A Step-by-Step 
guide to getting Nutch up and running.
   * NutchTutorial ''on the wiki''
   * [Nutch - The Java Search Engine] (Builds on the basic tutorials. 
Includes index maintenance scripts)

[Nutch Wiki] Update of HttpAuthenticationSchemes by susam

2007-10-31 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The comment on the change is:
typo fixes

--
  Even though, the 'http.auth.host' property is required only for NTLM 
authentication, it is advisable to set this for all cases, because, in case the 
crawler comes across a server which requires NTLM authentication (which you 
might not have anticipated), the crawler can still fetch the page.
  
  == Authentication Scope Specific Credentials ==
- This is an advanced feature that lets the user specify different credentials 
for different authentication scopes. After that you might want to try this out 
and appreciate the advantages.
+ This is an advanced feature that lets the user specify different credentials 
for different authentication scopes.
  
  === Quick Guide ===
  An example of 'conf/httpclient-auth.xml' configuration is provided below:
@@ -96, +96 @@

  
  If a page, say, 'http://192.168.101.34/index.jsp' requires authentication, 
then the common credentials would be used since there is no credential defined 
for this scope.
  
- The 'realm' attribute is optional in authscope tag and it can be omitted if 
you want the credentials to be used for all realms on a particular web-server 
(or all remaining realms as shown in the Quick Guide section above). One 
authentication scope should not be defined twice as different authscope tags 
for different credentials tag. However, if this is done by mistake, The 
credentials for the last defined authscope tag would be used. This is 
because, the XML parsing code, reads the file from top to bottom and sets the 
credentials for authentication-scopes. If the same authentication scope is 
encountered once again, it will be overwritten with the new credentials. 
However, one should not rely on this behavior as this might change with further 
developments.
+ The 'realm' attribute is optional in authscope tag and it can be omitted if 
you want the credentials to be used for all realms on a particular web-server 
(or all remaining realms as shown in the Quick Guide section above). One 
authentication scope should not be defined twice as different authscope tags 
for different credentials tag. However, if this is done by mistake, the 
credentials for the last defined authscope tag would be used. This is 
because, the XML parsing code, reads the file from top to bottom and sets the 
credentials for authentication-scopes. If the same authentication scope is 
encountered once again, it will be overwritten with the new credentials. 
However, one should not rely on this behavior as this might change with further 
developments.
  
  
  == Underlying HttpClient Library ==

[Nutch Wiki] Update of FAQ by robotgenius

2007-10-24 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by robotgenius:
http://wiki.apache.org/nutch/FAQ

--
  
  Please have a look on PrefixURLFilter.
  Adding some regular expressions to the urlfilter.regex.file might work, but 
adding a list with thousands of regular expressions would slow down your system 
excessively.
+ 
+ Alternatively, you can set db.ignore.external.links to true, and inject 
seeds from the domains you wish to crawl (these seeds must link to all pages 
you wish to crawl, directly or indirectly).  Doing this will let the crawl go 
through only these domains without leaving to start crawling external links.  
Unfortunately there is no way to record external links encountered for future 
processing, although a very small patch to the generator code can allow you to 
log these links to hadoop.log.
  
   How can I recover an aborted fetch process?

[Nutch Wiki] Trivial Update of GettingNutchRunningWithDebian by Thomas R Bailey

2007-10-21 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by Thomas R Bailey:
http://wiki.apache.org/nutch/GettingNutchRunningWithDebian

The comment on the change is:
File name in WEB-INF/classes directory requires modifying the nutch-default.xml

--
  #cd site2; jar xvf nutch-0.8.1.war; rm nutch-0.8.1.war; cd ..
  }}}
  ===  Configure the site1,site2 webapps ===
- Edit the site1/WEB-INF/classes/nutch-site.xml file for the searcher.dir 
parameter, so that it points back to your crawl directory under 
/usr/local/nutch:[[BR]]
+ Edit the site1/WEB-INF/classes/nutch-default.xml file for the searcher.dir 
parameter, so that it points back to your crawl directory under 
/usr/local/nutch and save it as nutch-site.xml after making the following 
changes:[[BR]]
  {{{namesearcher.dir/name
  value/usr/local/nutch/crawls/site1/value
  }}}

[Nutch Wiki] Trivial Update of GettingNutchRunningWithDebian by Thomas R Bailey

2007-10-21 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by Thomas R Bailey:
http://wiki.apache.org/nutch/GettingNutchRunningWithDebian

The comment on the change is:
removed extra '' in Catalina configuration

--
  Under Debian Etch, the Catalina configuration files are located under 
'''/etc/tomcat5.5/policy.d'''  At runtime they are combined into a single file, 
''/usr/share/tomcat5.5/conf/catalina.policy''  Do not edit the latter, as it 
will be overwrittten.[[BR]]
  At the end of /etc/tomcat5.5/policy.d/04webapps.policy include the following 
code:[[BR]]
  
- {{{grant codeBase file:/usr/share/tomcat5.5-webapps/-\'' {
+ {{{grant codeBase file:/usr/share/tomcat5.5-webapps/-\ {
  permission java.util.PropertyPermission user.dir, read;
  permission java.util.PropertyPermission java.io.tmpdir, read,write;
  permission java.util.PropertyPermission org.apache.*, read,execute;

[Nutch Wiki] Update of FrontPage by peterpuwang

2007-10-18 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by peterpuwang:
http://wiki.apache.org/nutch/FrontPage

--
  == Nutch Administration ==
   * DownloadingNutch
   * HardwareRequirements
+  * [http://www.thechristianlife.com/z/NutchGuideForDummies.htm Tutorial] -- 
Latest step by Step Installation guide for dummies: Nutch 0.9.
   * [http://lucene.apache.org/nutch/tutorial.html Tutorial] -- A Step-by-Step 
guide to getting Nutch up and running.
   * NutchTutorial ''on the wiki''
   * [Nutch - The Java Search Engine] (Builds on the basic tutorials. 
Includes index maintenance scripts)

[Nutch Wiki] Update of WritingPluginExample-0.9 by JasperKamperman

2007-10-18 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by JasperKamperman:
http://wiki.apache.org/nutch/WritingPluginExample-0%2e9

The comment on the change is:
Added explanation why the field is added as UN_TOKENIZED

--
  
  == The Example ==
  
- Consider this as a plugin example: We want to be able to recommend specific 
web pages for given search terms.  For this example we'll assume we're indexing 
this site.  As you may have noticed, there are a number of pages that talk 
about plugins.  What we want to do is have it so that if someone searches for 
the term plugin we recommend that they start at the PluginCentral page, but 
we also want to return all the normal hits in the expected ranking.  We'll 
seperate the search results page into a section of recommendations and then a 
section with the normal search results.
+ Consider this as a plugin example: We want to be able to recommend specific 
web pages for given search terms.  For this example we'll assume we're indexing 
this site.  As you may have noticed, there are a number of pages that talk 
about plugins.  What we want to do is have it so that if someone searches for 
the term plugins we recommend that they start at the PluginCentral page, but 
we also want to return all the normal hits in the expected ranking.  We'll 
seperate the search results page into a section of recommendations and then a 
section with the normal search results.
  
  You go through your site and add meta-tags to pages that list what terms they 
should be recommended for.  The tags look something like this:
  
@@ -177, +177 @@

  
  == The Indexer Extension ==
  
- The following is the code for the Indexing Filter extension.  If the document 
being indexed had a recommended meta tag this extension adds a lucene text 
field to the index called recommended with the content of that meta tag.  
Create a file called RecommendedIndexer.java in the source code directory:
+ The following is the code for the Indexing Filter extension.  If the document 
being indexed had a recommended meta tag this extension adds a lucene text 
field to the index called recommended with the content of that meta tag. 
Create a file called RecommendedIndexer.java in the source code directory:
  
  {{{
  package org.apache.nutch.parse.recommended;
@@ -242, +242 @@

}  
  }
  }}}
+ 
+ Note that the field is UN_TOKENIZED because we don't want the recommended tag 
to be cut up by a tokenizer. Change to TOKENIZED if you want to be able to 
search on parts of the tag, for example to put multiple recommended terms in 
one tag.  
  
  == The QueryFilter ==

[Nutch Wiki] Trivial Update of NutchTutorial by JoeyMazzarelli

2007-10-16 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by JoeyMazzarelli:
http://wiki.apache.org/nutch/NutchTutorial

The comment on the change is:
current path to DmozParser

--
  Next we select a random subset of these pages. (We use a random subset so 
that everyone who runs this tutorial doesn't hammer the same sites.) DMOZ 
contains around three million URLs. We select one out of every 5000, so that we 
end up with around 1000 URLs:
  
  {{{ mkdir dmoz
- bin/nutch org.apache.nutch.crawl.DmozParser content.rdf.u8 -subset 5000  
dmoz/urls }}}
+ bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000  
dmoz/urls }}}
  
  The parser also takes a few minutes, as it must parse the full file. Finally, 
we initialize the crawl db with the selected urls.

[Nutch Wiki] Update of HttpAuthenticationSchemes by susam

2007-09-24 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The comment on the change is:
Initial draft copied from protocol-http11

New page:
== Introduction ==
'protocol-httpclient' is a protocol plugin which supports retrieving documents 
via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest 
and NTLM authentication schemes for web server as well as proxy server.

== Author of Authentication Features ==
Susam Pal, Infosys Technologies Limited

== Necessity ==
There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. Some portions of 'protocol-httpclient' 
were written to solve these problems, provide additional features like 
authentication support for proxy server and better inline documentation for the 
properties to be used in 'nutch-site.xml' to enable 'protocol-http11' and use 
its authentication features. This is an improvement on the previous two 
plugins. The author of the authentication features has tested it in Infosys 
Technologies Limited by crawling the corporate intranet requiring NTLM 
authentication and this has been found to work well.

== Download ==
Currently, this plugin is in the form of patch in JIRA. Download the patch from 
[https://issues.apache.org/jira/browse/NUTCH-557 JIRA NUTCH-557] and apply it 
to trunk.

== Quick Guide ==
This section is a quick guide to configure authentication related properties 
for 'protocol-httpclient'.

 1. Include 'protocol-httpclient' in 'plugin.includes'.
 1. For basic or digest authentication in proxy server, set 
'http.proxy.username' and 'http.proxy.password'. Also, set 'http.proxy.realm' 
if you want to specify a realm  as the authentication scope.
 1. For NTLM authentication in proxy server, set 'http.proxy.username', 
'http.proxy.password', 'http.proxy.realm' and 'http.auth.host'. 
'http.proxy.realm' is the NTLM domain name. 'http.auth.host' is the host where 
the crawler is running.
 1. For basic or digest authentication in web servers, set 'http.auth.username' 
and 'http.auth.password'. Also, set 'http.auth.realm' if you want to specify a 
realm as the authentication scope.
 1. For NTLM authentication in proxy server, set 'http.auth.username', 
'http.auth.password', 'http.auth.realm' and 'http.auth.host'. 'http.auth.realm' 
is the NTLM domain name. 'http.auth.host' is the host where the crawler is 
running.
 1. It is recommended that 'http.useHttp11' be set to true.

This is explained in a little more detail in the next section.

== Nutch Configuration ==
To use 'protocol-httpclient', 'conf/nutch-site.xml has to be edited to include 
some properties which is explained in this section. First and foremost, to 
enable the plugin, this plugin must be added in the 'plugin.includes' of 
'nutch-site.xml'. So, this property would typically look like:-

{{{property
  nameplugin.includes/name
  valueprotocol-httpclient|urlfilter-regex|.../value
  description.../description
/property}}}

(... indicates a long line truncated)

Next, if authentication is required for proxy server, the following properties 
need to be set in 'conf/nutch-site.xml'.

 * http.proxy.username
 * http.proxy.password
 * http.proxy.realm (If a realm needs to be provided. In case of NTLM 
authentication, the domain name should be provided as its value.)
 * http.auth.host (This is required in case of NTLM authentication only. This 
is the host where the crawler would be running.)

If the web servers of the intranet are in a particular domain or realm and 
requires authentication, these properties should be set in 
'conf/nutch-site.xml'.

 * http.auth.username
 * http.auth.password
 * http.auth.realm
 * http.auth.host

The explanation for these properties are similar to that of the proxy 
authentication properties. As you might have noticed, 'http.auth.host' is used 
both for proxy NTLM authentication and web server NTLM authentication. Since, 
the host at which the HTTP requests are originating are same for both, so the 
same property is used for both and two different properties were not created.

Even though, the 'http.auth.host' property is required only for NTLM 
authentication, it is advisable to set this for all cases, because, in case the 
crawler comes across a server which requires NTLM authentication (which you 
might not have anticipated), the crawler can still fetch the page.

== Underlying HttpClient Library ==
'protocol-httpclient' is based on 
[http://jakarta.apache.org/httpcomponents/httpclient-3.x/ Jakarta Commons 
HttpClient]. Some servers support multiple schemes for authenticating users. 
Given that only one

[Nutch Wiki] Update of protocol-http11 by susam

2007-09-24 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/protocol-http11

The comment on the change is:
content moved to HttpAuthenticationSchemes

--
+ protocol-http11 has been converted to a patch for protocol-httpclient as per 
the discussion held at [https://issues.apache.org/jira/browse/NUTCH-557 JIRA 
NUTCH-557].
- == Introduction ==
- 'protocol-http11' is a protocol plugin which supports retrieving documents 
via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest 
and NTLM authentication schemes for web server as well as proxy server.
  
+ Therefore, the content of this page has been moved to 
HttpAuthenticationSchemes.
- == Author ==
- Susam Pal, Infosys Technologies Limited
  
- == Necessity ==
- There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. 'protocol-http11' was written to solve 
these problems, provide additional features like authentication support for 
proxy server and better inline documentation for the properties to be used in 
'nutch-site.xml' to enable 'protocol-http11' and use its authentication 
features. This is an improvement on the previous two plugins. The author of 
this plugin has tested it in Infosys Technologies Limited by crawling the 
corporate intranet requiring NTLM authentication and this has been found to 
work well. The name, 'protocol-http11' was chosen because, 'HTTP 1.1' is a 
valid protocol name.
- 
- == Download ==
- Currently, this plugin is in the form of patch in JIRA. Download the patch 
from [https://issues.apache.org/jira/browse/NUTCH-557 JIRA NUTCH-557] and apply 
it to trunk.
- 
- == Quick Guide ==
- This section is a quick guide to configure authentication related properties 
for 'protocol-http11'.
- 
-  1. Include 'protocol-http11' in 'plugin.includes'.
-  1. For basic or digest authentication in proxy server, set 
'http.proxy.username' and 'http.proxy.password'. Also, set 'http.proxy.realm' 
if you want to specify a realm  as the authentication scope.
-  1. For NTLM authentication in proxy server, set 'http.proxy.username', 
'http.proxy.password', 'http.proxy.realm' and 'http.auth.host'. 
'http.proxy.realm' is the NTLM domain name. 'http.auth.host' is the host where 
the crawler is running.
-  1. For basic or digest authentication in web servers, set 
'http.auth.username' and 'http.auth.password'. Also, set 'http.auth.realm' if 
you want to specify a realm as the authentication scope.
-  1. For NTLM authentication in proxy server, set 'http.auth.username', 
'http.auth.password', 'http.auth.realm' and 'http.auth.host'. 'http.auth.realm' 
is the NTLM domain name. 'http.auth.host' is the host where the crawler is 
running.
-  1. It is recommended that 'http.useHttp11' be set to true.
- 
- This is explained in a little more detail in the next section.
- 
- == Nutch Configuration ==
- To use 'protocol-http11', 'conf/nutch-site.xml has to be edited to include 
some properties which is explained in this section. First and foremost, to 
enable the plugin, this plugin must be added in the 'plugin.includes' of 
'nutch-site.xml'. So, this property would typically look like:-
- 
- {{{property
-   nameplugin.includes/name
-   valueprotocol-http11|urlfilter-regex|.../value
-   description.../description
- /property}}}
- 
- (... indicates truncation)
- 
- It is recommended that HTTP 1.1 should be enabled.
- 
- {{{property
-   namehttp.useHttp11/name
-   valuetrue/value
-   description.../description
- /property}}}
- 
- Next, if authentication is required for proxy server, the following 
properties need to be set in 'conf/nutch-site.xml'.
- 
-  * http.proxy.username
-  * http.proxy.password
-  * http.proxy.realm (If a realm needs to be provided. In case of NTLM 
authentication, the domain name should be provided as its value.)
-  * http.auth.host (This is required in case of NTLM authentication only. This 
is the host where the crawler would be running.)
- 
- If the web servers of the intranet are in a particular domain or realm and 
requires authentication, these properties should be set in 
'conf/nutch-site.xml'.
- 
-  * http.auth.username
-  * http.auth.password
-  * http.auth.realm
-  * http.auth.host
- 
- The explanation for these properties are similar to that of the proxy 
authentication properties. As you might have noticed, 'http.auth.host' is used 
both for proxy NTLM authentication and web server NTLM authentication. Since, 
the host at which the HTTP requests are originating are same for both, so the 
same property is used for both and two different

[Nutch Wiki] Update of FrontPage by susam

2007-09-24 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/FrontPage

The comment on the change is:
Http Authentication Schemes

--
   * CrossPlatformNutchScripts
   * MonitoringNutchCrawls - techniques for keeping an eye on a nutch crawl's 
progress.
   * [Nutch 0.9 Crawl Script Tutorial]
+  * HttpAuthenticationSchemes - How to enable Nutch to authenticate itself 
using NTLM, Basic or Digest authentication schemes.
  
  == Nutch Development ==
   * [:Becoming_A_Nutch_Developer:Becoming a Nutch Developer] - Start 
developing and contributing to Nutch.

[Nutch Wiki] Update of Help Wanted by GordonMohr

2007-09-21 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by GordonMohr:
http://wiki.apache.org/nutch/Help_Wanted

--
  
  This listing is provided as a reference only. No endorsements are given or 
implied. 
  
-  * The [http://www.archive.org Internet Archive] is seeking a 
[http://www.archive.org/about/webjobs.php#JavaSoftwareEngineer Java Software 
Engineer] (and/or a possible Indexing Operations Engineer) to contribute to 
[http://archive-access.sourceforge.net/projects/nutch NutchWAX] (the adaptation 
of Nutch for web archives), help make our ever-growing collections available 
for full-text search, and related projects.
+  * The [http://www.archive.org Internet Archive] is seeking a 
[http://www.archive.org/about/webjobs.php#SeniorSearchEngineer Senior Search 
Engineer] (and/or a possible Indexing Operations Engineer) to lead the 
development of our open source search tools and platforms.

[Nutch Wiki] Update of protocol-http11 by susam

2007-09-20 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/protocol-http11

The comment on the change is:
corrected XML code for enabling HTTP 1.1

--
  
  {{{property
namehttp.useHttp11/name
-   valuefalse/value
+   valuetrue/value
description.../description
  /property}}}

[Nutch Wiki] Update of Crawl by susam

2007-09-20 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/Crawl

The comment on the change is:
fixed topN bug

--
  == Script ==
  {{{
  #!/bin/sh
- 
- # Runs the Nutch bot to crawl or re-crawl
- # Usage: bin/runbot [safe]
- #If executed in 'safe' mode, it doesn't delete the temporary
- #directories generated during crawl. This might be helpful for
- #analysis and recovery in case a crawl fails.
- #
- # Author: Susam Pal
- 
- depth=2
+ depth=8
  threads=50
  adddays=5
- topN=2 # Comment this statement if you don't want to set topN value
+ topN=1000 #Comment this statement if you don't want to set topN value
  
  # Parse arguments
  if [ $1 == safe ]
@@ -101, +92 @@

  
  if [ -n $topN ]
  then
-   topN=--topN $rank
+   topN=-topN $topN
  else
topN=
  fi
@@ -125, +116 @@

$NUTCH_HOME/bin/nutch fetch $segment -threads $threads
if [ $? -ne 0 ]
then
- echo runbot: fetch $segment at depth $depth failed. Deleting segment 
$segment.
+ echo runbot: fetch $segment at depth `expr $i + 1` failed. Deleting 
segment $segment.
  rm -rf $segment
  continue
fi
@@ -138, +129 @@

  $NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*
  if [ $safe != yes ]
  then
-   rm -rf crawl/segments/*
+   rm -rf crawl/segments
  else
-   mkdir crawl/FETCHEDsegments
-   mv --verbose crawl/segments/* crawl/FETCHEDsegments
+   mv $MVARGS crawl/segments crawl/FETCHEDsegments
  fi
  
- mv --verbose crawl/MERGEDsegments/* crawl/segments
+ mv $MVARGS crawl/MERGEDsegments crawl/segments
- rmdir crawl/MERGEDsegments
  
  echo - Invert Links (Step 4 of $steps) -
  $NUTCH_HOME/bin/nutch invertlinks crawl/linkdb crawl/segments/*

[Nutch Wiki] Update of protocol-http11 by susam

2007-09-19 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/protocol-http11

The comment on the change is:
why the name is protocol-http11

--
  Susam Pal, Infosys Technologies Limited
  
  == Necessity ==
- There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. 'protocol-http11' was written to solve 
these problems, provide additional features like authentication support for 
proxy server and better inline documentation for the properties to be used in 
'nutch-site.xml' to enable 'protocol-http11' and use its authentication 
features. This is an improvement on the previous two plugins. The author of 
this plugin has tested it in Infosys Technologies Limited by crawling the 
corporate intranet requiring NTLM authentication and this has been found to 
work well.
+ There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. 'protocol-http11' was written to solve 
these problems, provide additional features like authentication support for 
proxy server and better inline documentation for the properties to be used in 
'nutch-site.xml' to enable 'protocol-http11' and use its authentication 
features. This is an improvement on the previous two plugins. The author of 
this plugin has tested it in Infosys Technologies Limited by crawling the 
corporate intranet requiring NTLM authentication and this has been found to 
work well. The name, 'protocol-http11' was chosen because, 'HTTP 1.1' is a 
valid protocol name.
  
  == Download ==
  Currently, this plugin is in the form of patch in JIRA. Download the patch 
from [https://issues.apache.org/jira/browse/NUTCH-557 JIRA NUTCH-557] and apply 
it to trunk.

[Nutch Wiki] Trivial Update of protocol-http11 by susam

2007-09-19 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/protocol-http11

The comment on the change is:
use pronoun 'it' for 'HttpClient'

--
  Even though, the 'http.auth.host' property is required only for NTLM 
authentication, it is advisable to set this for all cases, because, in case the 
crawler comes across a server which requires NTLM authentication (which you 
might not have anticipated), the crawler can still fetch the page.
  
  == Underlying HttpClient Library ==
- 'protocol-http11' is based on 
[http://jakarta.apache.org/httpcomponents/httpclient-3.x/ Jakarta Commons 
HttpClient]. Some servers support multiple schemes for authenticating users. 
Given that only one scheme may be used at a time for authenticating, HttpClient 
must choose which scheme to use. To accompish this, HttpClient uses an order of 
preference to select the correct authentication scheme. By default this order 
is: NTLM, Digest, Basic. For more information on the behavior during 
authentication, you might want to read the 
[http://jakarta.apache.org/httpcomponents/httpclient-3.x/authentication.html 
HttpClient Authentication Guide].
+ 'protocol-http11' is based on 
[http://jakarta.apache.org/httpcomponents/httpclient-3.x/ Jakarta Commons 
HttpClient]. Some servers support multiple schemes for authenticating users. 
Given that only one scheme may be used at a time for authenticating, it must 
choose which scheme to use. To accompish this, it uses an order of preference 
to select the correct authentication scheme. By default this order is: NTLM, 
Digest, Basic. For more information on the behavior during authentication, you 
might want to read the 
[http://jakarta.apache.org/httpcomponents/httpclient-3.x/authentication.html 
HttpClient Authentication Guide].
  
  == Need Help? ==
  If you need help, please feel free to post your question to the 
[http://lucene.apache.org/nutch/mailing_lists.html#Users nutch-user mailing 
list].

[Nutch Wiki] Update of protocol-http11 by susam

2007-09-18 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/protocol-http11

The comment on the change is:
download

--
  
  == Necessity ==
  There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. 'protocol-http11' was written to solve 
these problems, provide additional features like authentication support for 
proxy server and better inline documentation for the properties to be used in 
'nutch-site.xml' to enable 'protocol-http11' and use its authentication 
features. This is an improvement on the previous two plugins. The author of 
this plugin has tested it in Infosys Technologies Limited by crawling the 
corporate intranet requiring NTLM authentication and this has been found to 
work well.
+ 
+ == Download ==
+ Currently, this plugin is in the form of patch in JIRA. Download the patch 
from [https://issues.apache.org/jira/browse/NUTCH-557 JIRA NUTCH-557] and apply 
it to trunk.
  
  == Quick Guide ==
  This section is a quick guide to configure authentication related properties 
for 'protocol-http11'.

[Nutch Wiki] Update of protocol-http11 by susam

2007-09-18 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/protocol-http11

The comment on the change is:
rough draft

New page:
== Introduction ==
'protocol-http11' is a protocol plugin which supports retrieving documents via
the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest and
NTLM authentication schemes for web server as well as proxy server.

== Author ==
Susam Pal, Infosys Technologies Limited

== Necessity ==
There were two plugins already present, viz. 'protocol-http' and
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1,
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient'
supported HTTPS and had code for NTLM authentication but the NTLM
authentication didn't work due to a bug. 'protocol-http11' was written to solve
these problems, provide additional features like authentication support for
proxy server and better inline documentation for the properties to be used in
'nutch-site.xml' to enable 'protocol-http11' and use its authentication
features. This is an improvement on the previous two plugins. The author of
this plugin has tested it in Infosys Technologies Limited by crawling the
corporate intranet requiring NTLM authentication and this has been found to
work well.

== Quick Guide ==
This section is a quick guide to configure authentication related properties
for 'protocol-http11'.

1. Include 'protocol-http11' in 'plugin.includes'.
1. For basic or digest authentication in proxy server, set
'http.proxy.username' and 'http.proxy.password'. Also, set 'http.proxy.realm'
if you want to specify a realm as the authentication scope.
1. For NTLM authentication in proxy server, set 'http.proxy.username',
'http.proxy.password', 'http.proxy.realm' and 'http.auth.host'.
'http.proxy.realm' is the NTLM domain name. 'http.auth.host' is the host where
the crawler is running.
1. For basic or digest authentication in web servers, set 'http.auth.username'
and 'http.auth.password'. Also, set 'http.auth.realm' if you want to specify a
realm as the authentication scope.
1. For NTLM authentication in proxy server, set 'http.auth.username',
'http.auth.password', 'http.auth.realm' and 'http.auth.host'. 'http.auth.realm'
is the NTLM domain name. 'http.auth.host' is the host where the crawler is
running.
1. It is recommended that 'http.useHttp11' be set to true.

This is explained in a little more detail in the next section.

== Nutch Configuration ==
To use 'protocol-http11', 'conf/nutch-site.xml has to be edited to include some
properties which is explained in this section. First and foremost, to enable
the plugin, this plugin must be added in the 'plugin.includes' of
'nutch-site.xml'. So, this property would typically look like:-

{{{property
nameplugin.includes/name

valueprotocol-http11|urlfilter-regex|parse-(text|html|js|pdf|mp3|oo|msexcel|mspowerpoint|msword|pdf|rss|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value
description.../description
/property}}}

It is recommended that HTTP 1.1 should be enabled.

{{{property
namehttp.useHttp11/name
valuefalse/value
description.../description
/property}}}

Next, if authentication is required for proxy server, the following properties
need to be set in 'conf/nutch-site.xml'.

* http.proxy.username
* http.proxy.password
* http.proxy.realm (If a realm needs to be provided. In case of NTLM
authentication, the domain name should be provided as its value.)
* http.auth.host (This is required in case of NTLM authentication only. This
is the host where the crawler would be running.)

If the web servers of the intranet are in a particular domain or realm and
requires authentication, these properties should be set in
'conf/nutch-site.xml'.

* http.auth.username
* http.auth.password
* http.auth.realm
* http.auth.host

The explanation for these properties are similar to that of the proxy
authentication properties. As you might have noticed, 'http.auth.host' is used
both for proxy NTLM authentication and web server NTLM authentication. Since,
the host at which the HTTP requests are originating are same for both, so the
same property is used for both and two different properties were not created.

Even though, the 'http.auth.host' property is required only for NTLM
authentication, it is advisable to set this for all cases, because, in case the
crawler comes across a server which requires NTLM authentication (which you
might not have anticipated), the crawler can still fetch the page.

== Underlying HttpClient Library ==
'protocol-http11' is based on
[http://jakarta.apache.org/httpcomponents/httpclient-3.x/ Jakarta Commons
HttpClient]. Some servers support multiple schemes for authenticating users.
Given that only one scheme may be used at a time for

[Nutch Wiki] Trivial Update of protocol-http11 by susam

2007-09-18 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/protocol-http11

The comment on the change is:
truncated long line

--
  
  {{{property
nameplugin.includes/name
-   
valueprotocol-http11|urlfilter-regex|parse-(text|html|js|pdf|mp3|oo|msexcel|mspowerpoint|msword|pdf|rss|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value
+   valueprotocol-http11|urlfilter-regex|.../value
description.../description
  /property}}}
+ 
+ (... indicates truncation)
  
  It is recommended that HTTP 1.1 should be enabled.

[Nutch Wiki] Update of PluginCentral by susam

2007-09-18 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/PluginCentral

The comment on the change is:
protocol-http11

--
   * GeoPosition
   * [German]
   * [http://issues.apache.org/jira/browse/NUTCH-422 index-extra] - Adds 
user-configurable fields to the index.
-  * [http://issues.apache.org/jira/browse/NUTCH-427 protocol-smb] - Allows 
Nutch to crawl MS Windows Shares folder
+  * [http://issues.apache.org/jira/browse/NUTCH-427 protocol-smb] - Allows 
Nutch to crawl MS Windows Shares folder.
-  
+  * [protocol-http11] - Adds support for HTTP 1.1, HTTPS, Basic, Digest and 
NTLM authentication. ([https://issues.apache.org/jira/browse/NUTCH-557 
NUTCH-557])

[Nutch Wiki] Update of Nutch 0.9 Crawl Script Tutorial by Lyndon

2007-09-17 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by Lyndon:
http://wiki.apache.org/nutch/Nutch_0%2e9_Crawl_Script_Tutorial

--
  
  echo - Merge Segments (Step 4 of $steps) -
  $NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*
- if [ $? -ne 0 ]
+ if [ $? -eq 0 ]
  then
  if [ $safe != yes ]
  then

[Nutch Wiki] Trivial Update of Nutch 0.9 Crawl Script Tutorial by Lyndon

2007-09-17 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by Lyndon:
http://wiki.apache.org/nutch/Nutch_0%2e9_Crawl_Script_Tutorial

The comment on the change is:
the mergesegs failure test was inverted, needed to test $? -eq instead of -ne 

--
  
  echo - Merge Segments (Step 4 of $steps) -
  $NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*
+ 
+  /!\ '''Edit conflict - other version:''' 
  if [ $? -eq 0 ]
+ 
+  /!\ '''Edit conflict - your version:''' 
+ if [ $? -eq 0 ]
+ 
+  /!\ '''End of edit conflict''' 
  then
  if [ $safe != yes ]
  then

[Nutch Wiki] Trivial Update of Nutch 0.9 Crawl Script Tutorial by Lyndon

2007-09-17 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by Lyndon:
http://wiki.apache.org/nutch/Nutch_0%2e9_Crawl_Script_Tutorial

The comment on the change is:
removing wiki inserted conflict text sorry

--
  echo - Merge Segments (Step 4 of $steps) -
  $NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*
  
-  /!\ '''Edit conflict - other version:''' 
  if [ $? -eq 0 ]
- 
-  /!\ '''Edit conflict - your version:''' 
- if [ $? -eq 0 ]
- 
-  /!\ '''End of edit conflict''' 
  then
  if [ $safe != yes ]
  then

[Nutch Wiki] Update of Globe+Correspondent by wikicninfo

2007-09-10 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by wikicninfo:
http://wiki.apache.org/nutch/Globe+Correspondent

New page:
But community health centers draw patients for a number of reasons. They offer 
one-stop shopping, which can include dental care, substance abuse treatment, 
pediatric and prenatal care, and social services. Most have child care and 
translators on site for non-English speakers.

With the new Massachusetts health insurance law boosting the number of patients 
seeking care, community health centers south of Boston are scrambling to meet 
the demand.

Sign up for: Globe Headlines e-mail | Breaking News Alerts Manet Community 
Health Center, which has four locations in Quincy and one in Hull, is hiring 
two 
new family care physicians and a nurse practitioner. Brockton Neighborhood 
Health Center now stays open two hours later on weeknights. In February, it 
hired 
a nurse practitioner, two medical assistants, and two social workers, and is 
planning to hire 20 more staff members in the next six months.

We've seen a really significant increase in visits by new patients, said Sue 
Joss, executive director of the Brockton health center. Our phones are 
ringing off the hook for new patients.

The two centers are the only ones directly south of Boston. But community 
health centers in Fall River and New Bedford, which also serve people from this 
region, are experiencing the same increase in demand, and expanding hours to 
meet it.

The state's universal health insurance law, which is being rolled out this 
year, is bringing formerly uninsured people into the healthcare system. Many of 
these individuals and families are turning to community health centers, the 
locally based nonprofit organizations that arose from the antipoverty movement 
of 
the 1960s.

We are front and center in the new healthcare legislation, said Kerin 
O'Toole, spokeswoman for the Massachusetts League of Community Health Centers. 
We've 
seen quite a surge in demand. Although in many cases patients could go 
elsewhere, the health centers offer a whole range of services you can't get 
from a 
private provider.

The nation's first community health center opened at Columbia Point in 
Dorchester in 1965 as part of President Johnson's war on poverty.

Similar centers, supported by federal aid and private grants, opened across the 
country in poor and medically underserved areas. Today, the United States has 
more than a thousand centers, 52 of them in Massachusetts.

Business is thriving. In April, the Brockton center on Main Street saw a 12 
percent spike in patient visits over last year, and in May, a 9 percent 
increase. 
A new $16 million center is under construction next to the cramped downtown 
facility and is scheduled to open in November.

Statewide, patient loads at community health centers have been on the rise. In 
2006,  [http://www.teamflyelectronic.com/  Burglar alarm] centers in 
Massachusetts saw 760,301 patients, an increase of nearly 
94,000, or 14 percent, over the previous year.

The surge in demand at community health centers with the new law was not fully 
expected. The centers have long been a safety net in the healthcare system - 
places where people could go whether they had insurance or not. The insured 
usually have many choices when seeking care

Sign up for: Globe Headlines e-mail | Breaking News Alerts People are more 
aware of the community health centers and the services we provide, said Sheryl 
Turgeon, chief executive officer of Healthfirst, which draws patients from Fall 
River and nearby towns.

Community health centers also do outreach for Commonwealth Care, the new state 
health insurance program, and visitors to most centers can sign up for health 
insurance on the spot.buy [http://www.isefc.com.cn/  wow gold]

The heavy promotions the state has been doing to get the uninsured to sign up 
and take advantage of healthcare also seems to be a factor in the increasing 
number of visits, according to Toni McGuire, chief executive officer of the 
Manet center.

I think one of the biggest reasons for the increase is the advertising around 
Commonwealth Care, McGuire said. Said Joss of the Brockton center, There was 
never this kind of publicity around the free-care pool.

In the past, institutions that treated the uninsured were compensated by a pool 
of money administered by the state and paid into by hospitals and other large 
providers.

Another reason that community health centers are seeing more patients is that 
three of the four insurers working with Commonwealth Care tend to direct 
subscribers to the centers, according to Alan Sager, director of the health 
reform program at Boston University School of Public Health.

Sager said he is concerned that some community health centers may not be able 
to hire physicians quickly enough to meet the demand.

If health centers were deluged by

[Nutch Wiki] Update of wow+power+leveling by loki002

2007-09-10 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by loki002:
http://wiki.apache.org/nutch/wow+power+leveling

New page:
Baseball gave him his earliest challenge. He was an outstanding pitcher in 
Little League, and eventually, as a senior in high school, made the 
[http://www.toppowerlevel.net wow powerleveling] varsity, winning half the 
teamâs games with a record of five wins and two losses. At graduation, the 
coach named Daniel the [http://www.toppowerlevel.net wow power level] teamâs 
most valuable player. 

ããHis finest hour, though, came at a school science fair. He entered an 
exhibit showing how the [http://www.toppowerlevel.net wow power leveling] 
circulatory system works. It was primitive and crude, especially compared to 
the fancy, computerized, blinking-light models entered by other 
[http://www.toppowerlevel.net wow power level] students. My wife, Sara, felt 
embarrassed for him. 

ããIt turned out that the other kids [http://www.toppowerlevel.net wow power 
leveling] had not done their own work-their parents had made their exhibits. As 
the judges went on their rounds, they found that these other kids couldnât 
answer their questions. Daniel answered every one. When the judges awarded the 
[http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro power leveling] 
Albert Einstein Plaque for the best exhibit, they gave it to him. 

ããBy the time Daniel left for 
[http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro power leveling] he 
stood six feet tall and weighed 170 pounds. He was muscular and in superb 
[http://www.toppowerlevel.net wow powerleveling] condition, but he never 
pitched another inning, having given up baseball for English literature. I was 
sorry that he would not develop his athletic talent, but proud that he had made 
such a mature decision. 

ããOne day I told Daniel that the great failing in my 
[http://www.toppowerlevel.net wow power leveling] life had been that I didnât 
take a year or two off to travel when I finished college. This is the best way, 
to my way of thinking, to broaden oneself and develop a larger perspective on 
life. Once I had married and begun working, I found that the dream of 
[http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro power leveling] in 
another culture had vanished. 

ããDaniel thought about this. His friends said that he would be insane to 
put his career on [http://www.toppowerlevel.net wow powerleveling]. But he 
decided it wasnât so crazy. After graduation, he worked as a waiter at 
college, a bike messenger and a house painter. With the money he earned, he had 
enough to go to [http://www.toppowerlevel.net wow power level] Paris. 

ããThe [http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro power 
leveling] before he was to leave, I tossed in bed. I was trying to figure out 
something to say. Nothing came to mind. Maybe, I thought 
[http://www.toppowerlevel.net wow power leveling], it wasnât necessary to say 
anything. 

ããWhat does it matter in the course of a [http://www.toppowerlevel.net wow 
power level] if a father never tells a son what he really thinks of him? But as 
I stood before Daniel, I knew that it does matter. My father and I loved each 
other. Yet, I always regretted never hearing him put his feelings into words 
and never having the memory of that moment. Now, I could feel my palms sweat 
and my throat tighten. Why is it so hard to tell a son something from the 
heart? My mouth turned dry, and I knew I would be able to get out only a few 
[http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro power leveling] 
words clearly. 

ããâDaniel, I said, if I could have picked [http://www.toppowerlevel.net 
wow powerleveling], I would have picked you. 

ããThatâs all I could say. I wasnât sure he understood what I meant. 
Then he came toward me and threw his arms around me. For a moment, the 
[http://www.toppowerlevel.net wow power leveling] world and all its people 
vanished, and there was just Daniel and me in our home by the sea. 

ããHe was saying [http://www.toppowerlevel.net wow powerleveling], but my 
eyes misted over, and I couldnât understand what he was saying. All I was 
aware of was the stubble on his chin as his face pressed against mine. And 
then, the [http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro 
powerleveling]. I went to [http://www.toppowerlevel.net wow power level] work, 
and Daniel left a few hours later with his girlfriend. 

ããThat was seven weeks ago, and I think about 
[http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro powerleveling] when 
I walk along the beach on weekends. Thousands of miles away, somewhere out past 
the ocean waves breaking on the 
[http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro power level]
 deserted shore, he might be scurrying across Boulevard Saint Germain, 
strolling

[Nutch Wiki] Update of wow+power+leveling by matthieuriou

2007-09-10 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by matthieuriou:
http://wiki.apache.org/nutch/wow+power+leveling

The comment on the change is:
Spam attack

--
- Baseball gave him his earliest challenge. He was an outstanding pitcher in 
Little League, and eventually, as a senior in high school, made the 
[http://www.toppowerlevel.net wow powerleveling] varsity, winning half the 
teamâs games with a record of five wins and two losses. At graduation, the 
coach named Daniel the [http://www.toppowerlevel.net wow power level] teamâs 
most valuable player. 
+ deleted
  
- ããHis finest hour, though, came at a school science fair. He entered an 
exhibit showing how the [http://www.toppowerlevel.net wow power leveling] 
circulatory system works. It was primitive and crude, especially compared to 
the fancy, computerized, blinking-light models entered by other 
[http://www.toppowerlevel.net wow power level] students. My wife, Sara, felt 
embarrassed for him. 
- 
- ããIt turned out that the other kids [http://www.toppowerlevel.net wow 
power leveling] had not done their own work-their parents had made their 
exhibits. As the judges went on their rounds, they found that these other kids 
couldnât answer their questions. Daniel answered every one. When the judges 
awarded the [http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro power 
leveling] Albert Einstein Plaque for the best exhibit, they gave it to him. 
- 
- ããBy the time Daniel left for 
[http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro power leveling] he 
stood six feet tall and weighed 170 pounds. He was muscular and in superb 
[http://www.toppowerlevel.net wow powerleveling] condition, but he never 
pitched another inning, having given up baseball for English literature. I was 
sorry that he would not develop his athletic talent, but proud that he had made 
such a mature decision. 
- 
- ããOne day I told Daniel that the great failing in my 
[http://www.toppowerlevel.net wow power leveling] life had been that I didnât 
take a year or two off to travel when I finished college. This is the best way, 
to my way of thinking, to broaden oneself and develop a larger perspective on 
life. Once I had married and begun working, I found that the dream of 
[http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro power leveling] in 
another culture had vanished. 
- 
- ããDaniel thought about this. His friends said that he would be insane to 
put his career on [http://www.toppowerlevel.net wow powerleveling]. But he 
decided it wasnât so crazy. After graduation, he worked as a waiter at 
college, a bike messenger and a house painter. With the money he earned, he had 
enough to go to [http://www.toppowerlevel.net wow power level] Paris. 
- 
- ããThe [http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro power 
leveling] before he was to leave, I tossed in bed. I was trying to figure out 
something to say. Nothing came to mind. Maybe, I thought 
[http://www.toppowerlevel.net wow power leveling], it wasnât necessary to say 
anything. 
- 
- ããWhat does it matter in the course of a [http://www.toppowerlevel.net 
wow power level] if a father never tells a son what he really thinks of him? 
But as I stood before Daniel, I knew that it does matter. My father and I loved 
each other. Yet, I always regretted never hearing him put his feelings into 
words and never having the memory of that moment. Now, I could feel my palms 
sweat and my throat tighten. Why is it so hard to tell a son something from the 
heart? My mouth turned dry, and I knew I would be able to get out only a few 
[http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro power leveling] 
words clearly. 
- 
- ããâDaniel, I said, if I could have picked 
[http://www.toppowerlevel.net wow powerleveling], I would have picked you. 
- 
- ããThatâs all I could say. I wasnât sure he understood what I meant. 
Then he came toward me and threw his arms around me. For a moment, the 
[http://www.toppowerlevel.net wow power leveling] world and all its people 
vanished, and there was just Daniel and me in our home by the sea. 
- 
- ããHe was saying [http://www.toppowerlevel.net wow powerleveling], but my 
eyes misted over, and I couldnât understand what he was saying. All I was 
aware of was the stubble on his chin as his face pressed against mine. And 
then, the [http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro 
powerleveling]. I went to [http://www.toppowerlevel.net wow power level] work, 
and Daniel left a few hours later with his girlfriend. 
- 
- ããThat was seven weeks ago, and I think about 
[http://www.toppowerlevel.net/powerlist.php?fid=2871 lotro powerleveling] when 
I walk along the beach on weekends. Thousands of miles away, somewhere out past 
the ocean waves

[Nutch Wiki] Update of FrontPage by KevinBurton

2007-09-01 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by KevinBurton:
http://wiki.apache.org/nutch/FrontPage

--
   * [Search_Theory] Search Theory  White Papers
   * 
[http://wiki.apache.org/nutch-data/attachments/FrontPage/attachments/Hadoop-Nutch%200.8%20Tutorial%2022-07-06%20%3CNavoni%20Roberto%3E
 Tutorial Hadoop+Nutch 0.8 night build Roberto Navoni 24-07-06]
   * [http://blog.foofactory.fi/ FooFactory] Nutch and Hadoop related posts
+  * [http://spinn3r Spinn3r] [http://spinn3r.com/opensource.php Open Source 
components] (our contribution to the crawling OSS community with more to come).

[Nutch Wiki] Update of FrontPage by KevinBurton

2007-09-01 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by KevinBurton:
http://wiki.apache.org/nutch/FrontPage

--
   * [Search_Theory] Search Theory  White Papers
   * 
[http://wiki.apache.org/nutch-data/attachments/FrontPage/attachments/Hadoop-Nutch%200.8%20Tutorial%2022-07-06%20%3CNavoni%20Roberto%3E
 Tutorial Hadoop+Nutch 0.8 night build Roberto Navoni 24-07-06]
   * [http://blog.foofactory.fi/ FooFactory] Nutch and Hadoop related posts
-  * [http://spinn3r Spinn3r] [http://spinn3r.com/opensource.php Open Source 
components] (our contribution to the crawling OSS community with more to come).
+  * [http://spinn3r.com Spinn3r] [http://spinn3r.com/opensource.php Open 
Source components] (our contribution to the crawling OSS community with more to 
come).

[Nutch Wiki] Trivial Update of GettingNutchRunningWithDebian by Ted Guild

2007-08-30 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by Ted Guild:
http://wiki.apache.org/nutch/GettingNutchRunningWithDebian

The comment on the change is:
webapps not necessary and perhaps not desired to have running on server

--
   ''export JAVA_HOME''[[BR]]
  
  ==  Install Tomcat5.5 and Verify that it is functioning ==
-  ''# apt-get install tomcat5.5 libtomcat5.5-java tomcat5.5-admin 
tomcat5.5-webapps''[[BR]]
+  ''# apt-get install tomcat5.5 libtomcat5.5-java tomcat5.5-admin ''[[BR]]
  Verify Tomcat is running:[[BR]]
   ''# /etc/init.d/tomcat5.5 status''[[BR]]
   ''#Tomcat servlet engine is running with Java pid 
/var/lib/tomcat5.5/temp/tomcat5.5.pid''[[BR]]

[Nutch Wiki] Update of GettingNutchRunningWithDebian by Ted Guild

2007-08-30 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by Ted Guild:
http://wiki.apache.org/nutch/GettingNutchRunningWithDebian

The comment on the change is:
removing BR noise on conf example

--
  Under Debian Etch, the Catalina configuration files are located under 
'''/etc/tomcat5.5/policy.d'''  At runtime they are combined into a single file, 
''/usr/share/tomcat5.5/conf/catalina.policy''  Do not edit the latter, as it 
will be overwrittten.[[BR]]
  At the end of /etc/tomcat5.5/policy.d/04webapps.policy include the following 
code:[[BR]]
  
- {{{grant codeBase file:/usr/share/tomcat5.5-webapps/-\'' {[[BR]]
+ {{{grant codeBase file:/usr/share/tomcat5.5-webapps/-\'' {
- permission java.util.PropertyPermission user.dir, read;[[BR]]
+ permission java.util.PropertyPermission user.dir, read;
- permission java.util.PropertyPermission java.io.tmpdir, 
read,write;[[BR]]
+ permission java.util.PropertyPermission java.io.tmpdir, read,write;
- permission java.util.PropertyPermission org.apache.*, 
read,execute;[[BR]]
+ permission java.util.PropertyPermission org.apache.*, read,execute;
- permission java.io.FilePermission /usr/local/nutch/crawls/- , 
read;[[BR]]
+ permission java.io.FilePermission /usr/local/nutch/crawls/- , read;
- permission java.io.FilePermission /var/lib/tomcat5.5/temp, read;[[BR]]
+ permission java.io.FilePermission /var/lib/tomcat5.5/temp, read;
- permission java.io.FilePermission /var/lib/tomcat5.5/temp/-, 
read,write,execute,delete;[[BR]]
+ permission java.io.FilePermission /var/lib/tomcat5.5/temp/-, 
read,write,execute,delete;
- permission java.lang.RuntimePermission createClassLoader, ;[[BR]]
+ permission java.lang.RuntimePermission createClassLoader, ;
- permission java.security.AllPermission;[[BR]]
+ permission java.security.AllPermission;
- };[[BR]]}}}
+ };}}}
  '''Warning:  The last line here was necessary in order to make things work 
for me.  If anybody can supply a more restrictive permission set, please do 
so!!!  The effects of this are unknown'''[[BR]]
  
  == Install Multiple Copies of Nutch under Tomcat5.5 and Prepare for Searching 
==

[Nutch Wiki] Update of FrontPage by jbv

2007-08-29 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by jbv:
http://wiki.apache.org/nutch/FrontPage

--
   * SearchOverMultipleIndexes - configuring nutch to enable searching over 
multiple indexes
   * CrossPlatformNutchScripts
   * MonitoringNutchCrawls - techniques for keeping an eye on a nutch crawl's 
progress.
+  * [Nutch 0.9 Crawl Script Tutorial]
  
  == Nutch Development ==
   * [:Becoming_A_Nutch_Developer:Becoming a Nutch Developer] - Start 
developing and contributing to Nutch.

[Nutch Wiki] Update of Nutch 0.9 Crawl Script Tutorial by jbv

2007-08-29 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by jbv:
http://wiki.apache.org/nutch/Nutch_0%2e9_Crawl_Script_Tutorial

New page:
== Nutch 0.9 Crawl Script Tutorial ==

This is a walk through of the nutch 0.9 crawl.sh script provided by Susam Pal. 
(Thanks for getting me started Susam) I am only a novice at the whole nutch 
thing so this article may not be 100% accurate, but I think this will be 
helpful to other people just getting started, and I am hopeful other people who 
know more about Nutch will correct my mistakes and add more useful information 
to this document. Thanks to everyone in advance! (by the way I made changes to 
Susam's script, so if I broke stuff or made stupid mistakes, please correct me. 
;)

{{{

#!/bin/sh

# Runs the Nutch bot to crawl or re-crawl
# Usage: bin/runbot [safe]
#If executed in 'safe' mode, it doesn't delete the temporary
#directories generated during crawl. This might be helpful for
#analysis and recovery in case a crawl fails.
#
# Author: Susam Pal

}}}

First we specify some variables.

Depth tells how many times to crawl the web page. It seems like about 6 will 
get us all the files, but to be really thorough 9 should be enough.

Threads sets how many threads to crawl with, though ultimately this is limited 
by the conf file's max threads per server setting because for intranet crawling 
(like we are doing) there is really only one server.

adddays is something I don't know... need to figure out how to use this to our 
advantage for only crawling updated pages.

topN is not used right now because we want to crawl the whole intranet. You can 
use this during testing to limit the maximum number of pages to crawl per 
depth. But it will make it so you don't get all the possible results.

{{{
depth=0
threads=50
adddays=5
#topN=100 # Comment this statement if you don't want to set topN value

NUTCH_HOME=/data/nutch
CATALINA_HOME=/var/lib/tomcat5.5
}}}

Nutch home and catalina home have to be configured to point to where you 
installed nutch and tomcat respectively.

{{{
# Parse arguments
if [ $1 == safe ]
then
  safe=yes
fi

if [ -z $NUTCH_HOME ]
then
  NUTCH_HOME=.
  echo runbot: $0 could not find environment variable NUTCH_HOME
  echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script
else
  echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME
fi

if [ -z $CATALINA_HOME ]
then
  CATALINA_HOME=/opt/apache-tomcat-6.0.10
  echo runbot: $0 could not find environment variable NUTCH_HOME
  echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the script
else
  echo runbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME
fi

if [ -n $topN ]
then
  topN=--topN $rank
else
  topN=
fi

}}}

This last part just looks at the incoming variables and sets defaults, etc... 
Now on to the real work!!

== Step 1 : Inject ==

First thing is to inject the crawldb with an initial set of urls to crawl. In 
our case we are injecting only a single url contained in the nutch/seed/urls 
file. But in the future this file will probably get filled with out of date 
pages in order to hasten their recrawl. 

{{{

steps=10
echo - Inject (Step 1 of $steps) -
$NUTCH_HOME/bin/nutch inject crawl/crawldb seed

}}}

== Step 2 : Crawl ==

Next we do a for loop for $depth number of times. This for loop performs a 
couple steps which make up a basic 'crawl' procedure.

First it generates a segment which (I think?) is filled with empty data for 
each url in the crawldb that has reached it's expiration (i.e. has not been 
fetched in a month). I am not really sure what this does yet...

Then it fetches pages for those urls and stores that data in the segment. 
During this fetch phase, it also fills the crawldb with any new urls it finds 
(as long as they are not excluded by the filters we configured). This is really 
the key to making this for loop work, because the next time it gets to the 
segment generation there will be more urls in the crawldb for it to crawl. 
Notice however that the crawldb never gets cleared in this script... so if I am 
not mistaken there is no need to re-inject the root url.

Then we parse the data in the segments. Although depending on your 
configuration in the xml files this can be done automatically, in our case we 
are parsing it manually cause I read it would be faster this way... Haven't 
really given it a good test yet.

After these steps are done we have a nice set of segments that are full of data 
to be indexed.

{{{

echo - Generate, Fetch, Parse, Update (Step 2 of $steps) -
for((i=0; i  $depth; i++))
do
  echo --- Beginning crawl at depth `expr $i + 1` of $depth ---
  $NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/segments $topN -adddays 
$adddays
  if [ $? -ne 0 ]
  then
echo runbot: Stopping at depth $depth. No more URLs to fetch.
break
  fi
  segment=`ls -d crawl/segments/* | tail -1`

[Nutch Wiki] Update of jbv by jbv

2007-08-29 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by jbv:
http://wiki.apache.org/nutch/jbv

New page:
Oh look its my page. I'm Jeff Van Boxtel. I am a nutch newb, but I hope to 
learn more and contribute to this wiki. Yoroshiku.

[Nutch Wiki] Update of ClusteringPlugin by DawidWeiss

2007-08-27 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by DawidWeiss:
http://wiki.apache.org/nutch/ClusteringPlugin

The comment on the change is:
Adding initial help for configuring clustering with different clustering algs.

--
   * configureable parameters: Take a look at the defaults defined in 
nutch-default.xml (search for 'clustering').
   * meta data added to index: None. Clustering is performed dynamically for 
each result set.
   * required jars: The entire `lib` folder in the plugin must be present in 
classpath. More JARs might be needed from the Carrot2 project if additional  
algorithms or languages are to be used.
-  * plugin extension point interface: net.nutch.clustering.OnlineClusterer
+  * plugin extension point interface: `net.nutch.clustering.OnlineClusterer`
  
   * '''Carrot2 JARs come from codebase in version: 2.1'''
  
@@ -61, +61 @@

  /property
  }}}
  
+ == Using other Carrot2 clustering algorithms ==
+ 
+ To limit the size of the clustering plugin, the default implementation is 
shipped with the Lingo 
+ algorithm -- just one of several alternatives available in the Carrot2 
project. This section describes
+ how to substitute the default algorithm with a different one.
+ 
+ First, prepare the following:
+ 
+  * Install Nutch, enable the clustering plugin and make sure it works in the 
default configuration.
+  * Get a precompiled distribution of Carrot2 
([http://project.carrot2.org/download.html]), for example the DCS demo, or 
compile it from scratch.
+ 
+ Now you are ready to install another clustering algorithm. The instructions 
below show how to run STC (Suffix Tree Clustering) instead
+ of Lingo. We will use a binary release of the DCS as a source of the required 
Carrot2 JARs. It is assumed that Nutch's WAR is installed in a Web application 
container such as Jetty or Tomcat.
+ 
+ (will finish later)
+

[Nutch Wiki] Update of ClusteringPlugin by DawidWeiss

2007-08-27 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by DawidWeiss:
http://wiki.apache.org/nutch/ClusteringPlugin

--
   * Install Nutch, enable the clustering plugin and make sure it works in the 
default configuration.
   * Get a precompiled distribution of Carrot2 
([http://project.carrot2.org/download.html]), for example the DCS demo, or 
compile it from scratch.
  
- Now you are ready to install another clustering algorithm. The instructions 
below show how to run STC (Suffix Tree Clustering) instead
+ Now you are ready to install a different clustering algorithm. The 
instructions below show how to run STC (Suffix Tree Clustering) instead
- of Lingo. We will use a binary release of the DCS as a source of the required 
Carrot2 JARs. It is assumed that Nutch's WAR is installed in a Web application 
container such as Jetty or Tomcat.
+ of Lingo on the Jetty server (6.1.5). We will use a binary release of the DCS 
as a source of the required Carrot2 JARs.
  
- (will finish later)
+  * Make a symbolic link from `webapps/ROOT.war` to a compile Nutch WAR file 
(or just place it there).
+  * Make sure clustering and search work as expected.
+  * Download the binary release of DCS (we will use the one from the latest 
stable release: version 2.1).
+  * Carrot2 framework has the notion of a process -- a pipeline of 
components that process search results and emit clusters. We will need to 
provide the name of an XML file which defines such a process to Nutch's 
clustering extension and give it access to all the required classes it may 
need. Let's start from defining a process. Unpack the DCS distribution and 
locate the `descriptors` folder. You'll see a bunch of files inside, the one 
that interests us is called `alg-stc-en.xml`. Its contents should look like 
this:
+ {{{
+ local-process id=stc-en
+   nameSTC (+English)/name
+   descriptionSuffix Tree Clustering Algorithm/description
  
+   input  component-key=input-demo-webapp /
+ 
+   filter component-key=filter-language-detection-en /
+   filter component-key=filter-tokenizer /
+   filter component-key=filter-case-normalizer /
+   filter component-key=filter-stc /
+ 
+   output component-key=output-demo-webapp /
+ /local-process
+ }}}
+  * Now edit the above file and change `input` component key to `nutch-input` 
and `output` component key to `output-array`, leaving everything else exactly 
as it were.
+ {{{
+ local-process id=stc-en
+   nameSTC (+English)/name
+   descriptionSuffix Tree Clustering Algorithm/description
+ 
+   input  component-key=input-nutch /
+ 
+   filter component-key=filter-language-detection-en /
+   filter component-key=filter-tokenizer /
+   filter component-key=filter-case-normalizer /
+   filter component-key=filter-stc /
+ 
+   output component-key=output-array /
+ /local-process
+ }}}
+  * The filters you see in the process descriptor should also be available. 
Some of them are built in in the Carrot2 core, other should be copied from 
DCS's distribution to the same temporary folder we copied the process 
definition to. In our case the following filter definition files should be 
copied: `filter-language-detection-en.bsh`, `filter-tokenizer.bsh`, 
`filter-case-normalizer.bsh` and `filter-stc.bsh`. 
+  * Process and component descriptors are read as a resoure (relative to 
classpath). Jetty can be configured to have an additional classpath entry as a 
folder, but it slightly complicates things (hierarchy of classloaders may 
result in some hard-to-track errors). It will be easier to just place all the 
required stuff in Nutch's Web application context under `WEB-INF`. If you work 
with WAR file directly, you'll need to add the resources mentioned below to the 
WAR file (it's a ZIP file, so it's not a problem).
+  * Copy process and component descriptor files to 
`{NUTCH-CONTEXT}/WEB-INF/classes/`.
+

[Nutch Wiki] Update of ClusteringPlugin by DawidWeiss

2007-08-27 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by DawidWeiss:
http://wiki.apache.org/nutch/ClusteringPlugin

--
   * The filters you see in the process descriptor should also be available. 
Some of them are built in in the Carrot2 core, other should be copied from 
DCS's distribution to the same temporary folder we copied the process 
definition to. In our case the following filter definition files should be 
copied: `filter-language-detection-en.bsh`, `filter-tokenizer.bsh`, 
`filter-case-normalizer.bsh` and `filter-stc.bsh`. 
   * Process and component descriptors are read as a resoure (relative to 
classpath). Jetty can be configured to have an additional classpath entry as a 
folder, but it slightly complicates things (hierarchy of classloaders may 
result in some hard-to-track errors). It will be easier to just place all the 
required stuff in Nutch's Web application context under `WEB-INF`. If you work 
with WAR file directly, you'll need to add the resources mentioned below to the 
WAR file (it's a ZIP file, so it's not a problem).
   * Copy process and component descriptor files to 
`{NUTCH-CONTEXT}/WEB-INF/classes/`.
+  * Copy all JAR files from the DCS (`WEB-INF/lib/*.jar`) to 
`{NUTCH-CONTEXT}/WEB-INF/lib`. Overwrite older libraries, whenever prompted.
+  * Finally, the path to the clustering process should be added to 
`{NUTCH-CONTEXT}/WEB-INF/classes/nutch-site.xml`:
+ {{{
+ property
+   nameextension.clustering.carrot2.process-resource/name
+   value/alg-stc-en.xml/value
+ /property
+ }}}
+  * Restart your Web application container. The clustering plugin should use 
STC clustering algorithm if everything was ok.

[Nutch Wiki] Update of Support by RidaBenjelloun

2007-08-23 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by RidaBenjelloun:
http://wiki.apache.org/nutch/Support

--
  
* [http://www.sigram.com Andrzej Bialecki] ab at sigram.com
* CNLP  http://www.cnlp.org/tech/lucene.asp
-   * [http://www.doculibre.com/ Doculibre Inc.] Open source and information 
management consulting. (Lucene, Nutch, Hadoop etc.) info at doculibre.com
+   * [http://www.doculibre.com/ Doculibre Inc.] Open source and information 
management consulting. (Lucene, Nutch, Hadoop, Solr, Lius etc.) info at 
doculibre.com
* [http://www.dsen.nl DSEN - Java | J2EE | Agile Development  Consultancy]
* eventax GmbH info at eventax.com
* [http://www.foofactory.fi/ FooFactory] / Sami Siren info at foofactory 
dot fi

[Nutch Wiki] Update of WritingPluginExample-0.9 by BUlicny

2007-08-23 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by BUlicny:
http://wiki.apache.org/nutch/WritingPluginExample-0%2e9

The comment on the change is:
Updated format of QueryFilter extension in plugin.xml

--
name=Recommended Search Query Filter
point=org.apache.nutch.searcher.QueryFilter
implementation id=RecommendedQueryFilter
-   
class=org.apache.nutch.parse.recommended.RecommendedQueryFilter
+   
class=org.apache.nutch.parse.recommended.RecommendedQueryFilter
-   fields=DEFAULT/
+   parameter name=fields value=recommended/
+   /implementation
 /extension
  
  /plugin

[Nutch Wiki] Update of ClusteringPlugin by DawidWeiss

2007-08-22 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by DawidWeiss:
http://wiki.apache.org/nutch/ClusteringPlugin

The comment on the change is:
Updated the info about clustering plugin and instructions.

--
- -- Main.DawidWeiss - 01 Dec 2004
+ = Clustering Plugin =
  
- * plugin name: Online Search Results Clustering using Carrot2's Lingo 
component
+  plugin name:: Online Search Results Clustering using Carrot2 components
- * plugin version: 0.9.0
+  plugin version:: 1.0.3
  
+ == Plugin Info ==
- * provider: Dawid Weiss, The Carrot2 project
- * plugin home url: Included in Nutch CVS. Home WWW of the project: 
http://carrot2.sourceforge.net
- * plugin download url: A binary is included in Nutch CVS. The plugin builds 
together with Nutch.
- * license: BSD-style
  
- * short description: Search results clustering plugin.
+  * provider: The Carrot2 project, [http://www.carrot2.org]
+  * plugin home url: Plugin is included in Nutch codebase.
+  * plugin download url: Binaries included with Nutch.
+  * license: BSD-style
+  * short description: Plugin for clustering search results at query-time.
- * long description: A plugin that clusters search results into groups of 
(related, hopefully) documents.
+  * long description: This plugin organizes search results into groups of 
(related, hopefully) documents.
- * configureable parameters: Take a look at the defaults defined in 
nutch-default.xml (search for 'clustering').
+  * configureable parameters: Take a look at the defaults defined in 
nutch-default.xml (search for 'clustering').
- * meta data added to index: None. Clustering is performed dynamically for 
each result set.
+  * meta data added to index: None. Clustering is performed dynamically for 
each result set.
+  * required jars: The entire `lib` folder in the plugin must be present in 
classpath. More JARs might be needed from the Carrot2 project if additional  
algorithms or languages are to be used.
- * required jars: Many - the entire lib folder in the plugin must be present 
in classpath.
- * plugin extension points:
- 
- * plugin extension point interface: net.nutch.clustering.OnlineClusterer
+  * plugin extension point interface: net.nutch.clustering.OnlineClusterer
- * plugin extension point xml snippet: ?
  
  
- = Installation guide
+ == Installation guide ==
  
- * Create some index using the instructions provided in Nutch documentation,
+  * Create a search index using the instructions provided in Nutch 
documentation.
- * Deploy Nutch Web application and make sure the index is found and works 
(type a query and see if you
+  * Deploy Nutch Web application and make sure the index is found and 
searching works (type a query and see if you get any results).
- get any results).
+  * Stop the web server (Tomcat, Jetty or anything you like).
+  * Modify `WEB-INF/classes/nutch-default.xml` file and include the clustering 
plugin (it is by default ignored) by adding `clustering-carrot2` to 
`plugin.includes` property.
+  * Restart your web server and reload the search page. You should see the 
`clustering` checkbox next to `search` button. Enable it and rerun your query. 
Cluster labels and documents should appear to the right of search results.
  
- * Stop Web container (Tomcat)
- * You must modify =WEB-INF/classes/nutch-default.xml= file and include the 
clustering plugin (it is by default
- ignored).
- 
- plugin.includes
- 
- 
protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)|clustering-carrot2
- Regular expression naming plugin directory names to
- 
- include.  Any plugin not matching this expression is excluded.  By
- default Nutch includes crawling just HTML and plain text via HTTP,
- and basic indexing and search plugins.
- 
- * Restart Tomcat.
- 
- * Reload the search page of Nutch. You should see the =clustering= checkbox 
next to =search= button.
- Enable it and rerun your query. Clustered results should appear to the right.
-

[Nutch Wiki] Update of ClusteringPlugin by DawidWeiss

2007-08-22 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by DawidWeiss:
http://wiki.apache.org/nutch/ClusteringPlugin

The comment on the change is:
Configuration help added.

--
   * required jars: The entire `lib` folder in the plugin must be present in 
classpath. More JARs might be needed from the Carrot2 project if additional  
algorithms or languages are to be used.
   * plugin extension point interface: net.nutch.clustering.OnlineClusterer
  
+  * '''Carrot2 JARs come from codebase in version: 2.1'''
  
  == Installation guide ==
  
@@ -27, +28 @@

   * Modify `WEB-INF/classes/nutch-default.xml` file and include the clustering 
plugin (it is by default ignored) by adding `clustering-carrot2` to 
`plugin.includes` property.
   * Restart your web server and reload the search page. You should see the 
`clustering` checkbox next to `search` button. Enable it and rerun your query. 
Cluster labels and documents should appear to the right of search results.
  
+ Note that the user interface in default Nutch's Web application is very 
limited and you'll most likely need something more application-specific. Look 
at [http://www.carrot2.org] or [http://www.carrot-search.com] for inspiration.
+ 
+ 
+ == Configuration guide ==
+ 
+ Libraries in this release are precompiled with stemming and stop words for 
various languages present in the Carrot2 codebase. You should define the 
default language and supported languages in Nutch configuration file 
(nutch-site.xml). If nothing is given in Nutch configuration English is used by 
default. The following properties can be added to `nutch-site.xml`:
+ 
+ {{{
+ !-- Carrot2 Clustering plugin configuration --
+ 
+ property
+   nameextension.clustering.carrot2.defaultLanguage/name
+   valueen/value
+   descriptionTwo-letter ISO code of the language. 
+   http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt/description
+ /property
+ 
+ property
+   nameextension.clustering.carrot2.languages/name
+   valueen,nl,da,fi,fr,de,it,no,pl,pt,ru,es,sv/value
+   descriptionAll languages to be used by the clustering plugin. 
+   This list includes all currently supported languages (although not all of 
them
+   will successfully instantiate -- support for Polish requires additional
+   libraries for instance). Adjust to your needs, fewer languages take less
+   memory.
+ 
+   If you use the language recognizer plugin, then each hit will come with its
+   own ISO language code. All hits with no explicit language take the default
+   language specified in extension.clustering.carrot2.defaultLanguage 
property.
+   /description
+ /property
+ }}}
+

[Nutch Wiki] Update of ClusteringPlugin by DawidWeiss

2007-08-22 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by DawidWeiss:
http://wiki.apache.org/nutch/ClusteringPlugin

The comment on the change is:
Additional languages for the supported language list.

--
  
  property
nameextension.clustering.carrot2.languages/name
-   valueen,nl,da,fi,fr,de,it,no,pl,pt,ru,es,sv/value
+   valueen,nl,da,fi,fr,de,it,no,pl,pt,ru,es,sv,tr,ro,hu/value
descriptionAll languages to be used by the clustering plugin. 
This list includes all currently supported languages (although not all of 
them
will successfully instantiate -- support for Polish requires additional

[Nutch Wiki] Update of PublicServers by rlhoad

2007-08-21 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by rlhoad:
http://wiki.apache.org/nutch/PublicServers

--
  = Public search engines using Nutch =
  
  Please sort by name alphabetically
- 
-   * [http://www.arancia.com Arancia Outlaw] Italian search engine for legal 
material, laws and high court sentences.
  
* [http://askaboutoil.com AskAboutOil] is a vertical search portal for the 
petroleum industry.

[Nutch Wiki] Update of FrontPage by susam

2007-08-20 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/FrontPage

The comment on the change is:
crawl script

--
   * [Upgrading_from_0.8.x_to_0.9]
   * RunNutchInEclipse for v0.8
   * [RunNutchInEclipse0.9] for v0.9
+  * Crawl - script to crawl (and possible recrawl too)
   * IntranetRecrawl - script to recrawl a crawl
   * MergeCrawl - script to merge 2 (or more) crawls 
   * SearchOverMultipleIndexes - configuring nutch to enable searching over 
multiple indexes

[Nutch Wiki] Update of Crawl by susam

2007-08-20 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/Crawl

The comment on the change is:
crawl script

New page:
== Introduction ==
This is a script to crawl an Internet or the web. It does not crawl using the 
'bin/crawl' tool or 'Crawl' class present in Nutch, therefore the filters 
present in 'conf/crawl-urlfilter.txt ' has not effect on this script. The 
filters for this script must be set in 'regex-urlfilter.txt'.

== Steps ==
The complete job of this script has been divided broadly into 8 steps.

 # Inject URLs
 # Generate, Fetch, Parse, Update Loop
 # Merge Segments
 # Invert Links
 # Index
 # Dedup
 # Merge Indexes
 # Reload index

== Modes of Execution ==
The script can be executed in two modes:-
 * Normal Mode
 * Safe Mode

=== Normal Mode ===
If the script is executed with the command 'bin/runbot', it will delete all the 
directories such as fetched segments, generated indexes, etc, so as to save 
space. It will also reload the index after it finishes crawling and the new 
crawl DB would go live.

'''Caution:''' This also means that if something has gone wrong during the 
crawl and the resultant crawl DB is corrupt or incomplete, it might not return 
results for any query. And since this crawl DB would go live in 'normal mode', 
your visitors may not see any results.

=== Safe Mode ===
Alternatively, the script can be executed in safe mode as 'bin/runbot safe' 
which will prevent deletion of these directories.
If errors occur, you can take recovery action because the directories haven't 
been deleted. You can then manually merge the segments, generate indexes, etc. 
from the directories and make the resultant crawl DB go live.

Safe Mode also suppresses the automatic reloading of the new index. Therefore, 
the resultant crawl DB does not go live immediately after crawling. This gives 
you a chance to first test the new crawl DB for valid results. If it is found 
to work, you can make this new DB go live.

=== Normal Mode vs. Safe Mode ===
Ideally, you should run the script in safe mode a couple of times, to make sure 
the crawl is running fine. If you are sure, that everything will go fine, you 
need not run it in safe mode.

== Tinkering ==
Adjust the variables, 'depth', 'threads', 'adddays' and 'topN' as per your 
needs. Delete or comment out the statement for 'topN' assignment if you do not 
wish to set a 'topN' value.

=== NUTCH_HOME ===
If you are not executing the script as 'bin/runbot' from Nutch directory, you 
should either set the environment variable 'NUTCH_HOME' or edit the following 
in the script:-

{{{if [ -z $NUTCH_HOME ]
then
  NUTCH_HOME=.}}}

Set 'NUTCH_HOME' to the path of the Nutch directory (if you are not setting it 
as an environment variable, since if environment variable is set, the above 
assignment is ignored).

=== CATALINA_HOME ===
'CATALINA_HOME' points to the Tomcat installation directory. You must either 
set this as an environment variable or set it by editing the following lines in 
the script:-

{{{if [ -z $CATALINA_HOME ]
then
  CATALINA_HOME=/opt/apache-tomcat-6.0.10}}}

Similar to the previous section, if this variable is set in the environment, 
then the above assignment is ignored.

== Can it re-crawl? ==
The author has used this script to re-crawl a couple of times. However, no real 
world testing has been done for re-crawling. Therefore, you may try to use the 
script of re-crawl. If it works out fine or it doesn't work properly for 
re-crawl, please let us know.

== Script ==
{{{
#!/bin/sh

# Runs the Nutch bot to crawl or re-crawl
# Usage: bin/runbot [safe]
#If executed in 'safe' mode, it doesn't delete the temporary
#directories generated during crawl. This might be helpful for
#analysis and recovery in case a crawl fails.
#
# Author: Susam Pal

depth=2
threads=50
adddays=5
topN=2 # Comment this statement if you don't want to set topN value

# Parse arguments
if [ $1 == safe ]
then
  safe=yes
fi

if [ -z $NUTCH_HOME ]
then
  NUTCH_HOME=.
  echo runbot: $0 could not find environment variable NUTCH_HOME
  echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script 
else
  echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME 
fi

if [ -z $CATALINA_HOME ]
then
  CATALINA_HOME=/opt/apache-tomcat-6.0.10
  echo runbot: $0 could not find environment variable NUTCH_HOME
  echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the script 
else
  echo runbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME 
fi

if [ -n $topN ]
then
  topN=--topN $rank
else
  topN=
fi

steps=8
echo - Inject (Step 1 of $steps) -
$NUTCH_HOME/bin/nutch inject crawl/crawldb urls

echo - Generate, Fetch, Parse, Update (Step 2 of $steps) -
for((i=0; i  $depth; i++))
do
  echo --- Beginning crawl at depth `expr $i + 1` of $depth ---
  $NUTCH_HOME/bin/nutch generate crawl/crawldb

[Nutch Wiki] Update of FrontPage by susam

2007-08-20 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/FrontPage

--
   * [Upgrading_from_0.8.x_to_0.9]
   * RunNutchInEclipse for v0.8
   * [RunNutchInEclipse0.9] for v0.9
-  * Crawl - script to crawl (and possible recrawl too)
+  * [Crawl] - script to crawl (and possible recrawl too)
   * IntranetRecrawl - script to recrawl a crawl
   * MergeCrawl - script to merge 2 (or more) crawls 
   * SearchOverMultipleIndexes - configuring nutch to enable searching over 
multiple indexes

[Nutch Wiki] Trivial Update of Crawl by susam

2007-08-20 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/Crawl

--
  == Steps ==
  The complete job of this script has been divided broadly into 8 steps.
  
-  # Inject URLs
+  1. Inject URLs
-  # Generate, Fetch, Parse, Update Loop
+  2. Generate, Fetch, Parse, Update Loop
-  # Merge Segments
+  3. Merge Segments
-  # Invert Links
+  4. Invert Links
-  # Index
+  5. Index
-  # Dedup
+  6. Dedup
-  # Merge Indexes
+  7. Merge Indexes
-  # Reload index
+  8. Reload index
  
  == Modes of Execution ==
  The script can be executed in two modes:-

[Nutch Wiki] Update of IntranetRecrawl by JamesVictor

2007-08-16 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by JamesVictor:
http://wiki.apache.org/nutch/IntranetRecrawl

The comment on the change is:
clarified script location, removed erroneous info on 0.9.0 changes

--
  }}}
  
  == Version 0.8.0  and 0.9.0 ==
+ 
- Place in the bin sub-directory within Nutch and run.
+ Place in the `bin` sub-directory within your Nutch install and run.
  
- ** MUST CALL SCRIPT USING THE FULL PATH TO THE SCRIPT OR IT WON'T WORK***
+ '''CALL THE SCRIPT USING THE FULL PATH TO THE SCRIPT OR IT WON'T WORK'''
  
  === Example Usage ===
  `./usr/local/nutch/bin/recrawl /usr/local/tomcat/webapps/ROOT 
/usr/local/nutch/crawl 10 31`
  
- (with adddays being '31', all pages will be recrawled)
+ Setting adddays at `31` causes all pages will to be recrawled.
  
  === Changes for 0.9.0 ===
  
+ No changes necessary for this to run with Nutch 0.9.0.
- Change line 76 to read
- {{{
- #Sets the path to bin
- nutch_dir=`dirname $0`/bin
- }}}
- 
- in order for the proper path to be built. Everything else may remain the same.
  
  === Code ===

[Nutch Wiki] Update of IntranetRecrawl by JamesVictor

2007-08-15 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by JamesVictor:
http://wiki.apache.org/nutch/IntranetRecrawl

The comment on the change is:
added instructions for Nutch 0.9.0 script

--
  
  }}}
  
- == Version 0.8.0 ==
+ == Version 0.8.0  and 0.9.0 ==
  Place in the bin sub-directory within Nutch and run.
  
  ** MUST CALL SCRIPT USING THE FULL PATH TO THE SCRIPT OR IT WON'T WORK***
+ 
  === Example Usage ===
- ./usr/local/nutch/bin/recrawl /usr/local/tomcat/webapps/ROOT 
/usr/local/nutch/crawl 10 31
+ `./usr/local/nutch/bin/recrawl /usr/local/tomcat/webapps/ROOT 
/usr/local/nutch/crawl 10 31`
  
  (with adddays being '31', all pages will be recrawled)
+ 
+ === Changes for 0.9.0 ===
+ 
+ Change line 76 to read
+ {{{
+ #Sets the path to bin
+ nutch_dir=`dirname $0`/bin
+ }}}
+ 
+ in order for the proper path to be built. Everything else may remain the same.
  
  === Code ===

[Nutch Wiki] Update of FrontPage by ra

2007-08-15 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by ra:
http://wiki.apache.org/nutch/FrontPage

--
   * [:08CommandLineOptions:Commandline] options for version 0.8
   * OverviewDeploymentConfigs
   * NutchConfigurationFiles
-  * GettingNutchRunningWithUtf8 - For support of non-ASCII characters 
(Chinese, Japanese and Korean).
+  * GettingNutchRunningWithUtf8 - For support of non-ASCII characters 
(Chinese, German, Japanese, Korean).
   * GettingNutchRunningWithResin - Resin is a JSP/Servlet/EJB application 
server (alternative to tomcat).
   * GettingNutchRunningWithJetty
   * GettingNutchRunningWithUbuntu

[Nutch Wiki] Trivial Update of FrontPage by ra

2007-08-15 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by ra:
http://wiki.apache.org/nutch/FrontPage

--
   * CreateNewFilter - for example to add a category metadata to your index and 
be able to search for it
   * UpgradeFrom07To08
   * [Upgrading_from_0.8.x_to_0.9]
-  * RunNutchInEclipse
+  * RunNutchInEclipse for v0.8
-  * [RunNutchInEclipse0.9] (update - work in progress for 0.9)
+  * [RunNutchInEclipse0.9] for v0.9
   * IntranetRecrawl - script to recrawl a crawl
   * MergeCrawl - script to merge 2 (or more) crawls 
   * SearchOverMultipleIndexes - configuring nutch to enable searching over 
multiple indexes

[Nutch Wiki] Update of RunNutchInEclipse0.9 by ra

2007-08-12 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by ra:
http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9

--
   * change the property plugin.folders to ./src/plugin on 
$NUTCH_HOME/conf/nutch-defaul.xml
   * make sure Nutch is configured correctly before testing it into Eclipse ;-)
  
+ === missing org.farng and com.etranslate ===
+ You will encounter problems with some imports in parse-mp3 and parse-rtf 
plugins (30 errors in my case).
+ Because of incompatibility with Apache license they were left from sources. 
+ You can download them here:
+ 
+ http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/
+ 
+ http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/
+ 
+ Copy the jar files into src/plugin/parse-mp3/lib and 
src/plugin/parse-rtf/lib/ respectively.
+ Then add them to the libraries to the build path (First refresh the 
workspace. Then Right click on the source
+ folder = Java Build Path = Libraries = Add Jars).
+ 
+ 
  === Build Nutch ===
   * In case you setup the project correctly, Eclipse will build Nutch for you 
into tmp_build.
  
  
- --- okay up to here... going to do rest tomorrow...
  
  
  
@@ -62, +75 @@

   * click on Run
   * if all works, you should see Nutch getting busy at crawling :-)
  
- == Debug Nutch in Eclipse ==
+ == Debug Nutch in Eclipse (not yet tested for 0.9) ==
   * Set breakpoints and debug a crawl
   * It can be tricky to find out where to set the breakpoint, because of the 
Hadoop jobs. Here are a few good places to set breakpoints:
  {{{
@@ -78, +91 @@

  Yes, Nutch and Eclipse can be a difficult companionship sometimes ;-)
  
  === eclipse: Cannot create project content in workspace ===
- The nutch source code must be out of the workspace folder. My first attemp 
was download the code with eclipse (svn) under my workspace. When I try to 
create the project using existing code, eclipse don't let me do it from source 
code into the workspace. I use the source code out of my workspace and it work 
fine.
+ The nutch source code must be out of the workspace folder. My first attempt 
was download the code with eclipse (svn) under my workspace. When I try to 
create the project using existing code, eclipse don't let me do it from source 
code into the workspace. I use the source code out of my workspace and it work 
fine.
  
  === plugin dir not found ===
- Make sure you set your plugin.folders property correct, instead of using a 
relative path you can use a absoluth one as well in nutch-defaults.xml or may 
be better in nutch-site.xml
+ Make sure you set your plugin.folders property correct, instead of using a 
relative path you can use a absolute one as well in nutch-defaults.xml or may 
be better in nutch-site.xml
  {{{
  property
nameplugin.folders/name
-   value/home/../nutch-0.8/src/plugin/value
+   value/home/../nutch-0.9/src/plugin/value
  }}}
  
  
@@ -107, +120 @@

   * open the class itself, rightclick
   * refresh the build dir
  
- === missing org.farng and com.etranslate ===
- You may have problems with some imports in parse-mp3 and parse-rtf plugins. 
Because of incompatibility with apache licence they were left from sources. You 
can find it here:
- 
- http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/
- 
- http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/
- 
- You need to copy jar files into plugin lib path and refresh the project. 
- 
  
  === debugging hadoop classes ===
   Sometime it makes sense to also have the hadoop classes available during 
debugging. So, you can check out the Hadoop sources on your machine and add the 
sources to the  hadoop-xxx.jar. Alternatively, you can:

[Nutch Wiki] Update of WritingPluginExample-0.9 by ra

2007-08-12 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by ra:
http://wiki.apache.org/nutch/WritingPluginExample-0%2e9

--
  Edit this block to add a line for your plugin before the /target tag. 
  
  {{{
-   ant dir=reccomended target=deploy /
+   ant dir=recommended target=deploy /
  }}}
  
  Running 'ant' in the root of your checkout directory should get everything 
compiled and jared up.  The next time you run a crawl your parser and index 
filter should get used.

[Nutch Wiki] Trivial Update of FrontPage by ra

2007-08-11 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by ra:
http://wiki.apache.org/nutch/FrontPage

--
   * UpgradeFrom07To08
   * [Upgrading_from_0.8.x_to_0.9]
   * RunNutchInEclipse
-  * RunNutchInEclipse0.9 (update - work in progress for 0.9)
+  * RunNutchInEclipse0_9 (update - work in progress for 0.9)
   * IntranetRecrawl - script to recrawl a crawl
   * MergeCrawl - script to merge 2 (or more) crawls 
   * SearchOverMultipleIndexes - configuring nutch to enable searching over 
multiple indexes

[Nutch Wiki] Update of FrontPage by ra

2007-08-11 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by ra:
http://wiki.apache.org/nutch/FrontPage

--
   * UpgradeFrom07To08
   * [Upgrading_from_0.8.x_to_0.9]
   * RunNutchInEclipse
-  * RunNutchInEclipse0_9 (update - work in progress for 0.9)
+  * [RunNutchInEclipse0.9] (update - work in progress for 0.9)
   * IntranetRecrawl - script to recrawl a crawl
   * MergeCrawl - script to merge 2 (or more) crawls 
   * SearchOverMultipleIndexes - configuring nutch to enable searching over 
multiple indexes

[Nutch Wiki] Update of FrontPage by ra

2007-08-11 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by ra:
http://wiki.apache.org/nutch/FrontPage

--
   * UpgradeFrom07To08
   * [Upgrading_from_0.8.x_to_0.9]
   * RunNutchInEclipse
+  * RunNutchInEclipse0.9 (update - work in progress for 0.9)
   * IntranetRecrawl - script to recrawl a crawl
   * MergeCrawl - script to merge 2 (or more) crawls 
   * SearchOverMultipleIndexes - configuring nutch to enable searching over 
multiple indexes

[Nutch Wiki] Update of RunNutchInEclipse0.9 by ra

2007-08-11 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by ra:
http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9

New page:
= RunNutchInEclipse =

This is a work in progress. If you find errors or would like to improve this 
page, just create an account [UserPreferences] and start editing this page :-)

== Tested with ==
 * Nutch release 0.9
 * Eclipse 3.3 - aka Europa
 * Java 1.6
 * Ubuntu (should work on most platform, though)

== Before you start ==

Setting up Nutch to run into Eclipse can be tricky, and most of the time you 
are much faster if you edit Nutch in Eclipse but run the scripts from the 
command line (my 2 cents).
However, it's very useful to be able to debug Nutch in Eclipse. But again you 
might be quicker by looking at the logs (logs/hadoop.log)...

== Steps ==

=== Install Nutch ===
 * Grab a fresh release of Nutch 0.9
 * Do not build Nutch now. Make sure you have no .project and .classpath files 
in the Nutch directory

=== Create a new java project in Eclipse ===
 * File  New  Project  Java project  click Next
 * select Create project from existing source and use the location where you 
downloaded Nutch
 * click on Next, and wait while Eclipse is scanning the folders
 * add the folder conf to the classpath (third tab and then add class folder)
 * Eclipse should have guessed all the java files that must be added on your 
classpath. If it's not the case, add src/java, src/test and all plugin 
src/java and src/test folders to your source folders. Also add all jars in 
lib and in the plugin lib folders to your libraries 
 * set output dir to tmp_build, create it if necessary
 * DO NOT add build to classpath


=== Configure Nutch ===
 * see the [http://wiki.apache.org/nutch/NutchTutorial Tutorial]
 * change the property plugin.folders to ./src/plugin on 
$NUTCH_HOME/conf/nutch-defaul.xml
 * make sure Nutch is configured correctly before testing it into Eclipse ;-)

=== Build Nutch ===
 * In case you setup the project correctly, Eclipse will build Nutch for you 
into tmp_build.


--- okay up to here... going to do rest tomorrow...



=== Create Eclipse launcher ===
 * Menu Run  Run...
 * create New for Java Application
 * set in Main class
{{{
org.apache.nutch.crawl.Crawl
}}}
 * on tab Arguments, Program Arguments
{{{
urls -dir crawl -depth 3 -topN 50
}}}
 * in VM arguments
{{{
-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
}}}
 * click on Run
 * if all works, you should see Nutch getting busy at crawling :-)

== Debug Nutch in Eclipse ==
 * Set breakpoints and debug a crawl
 * It can be tricky to find out where to set the breakpoint, because of the 
Hadoop jobs. Here are a few good places to set breakpoints:
{{{
Fetcher [line: 371] - run
Fetcher [line: 438] - fetch
Fetcher$FetcherThread [line: 149] - run()
Generator [line: 281] - generate
Generator$Selector [line: 119] - map
OutlinkExtractor [line: 111] - getOutlinks
}}}

== If things do not work... ==
Yes, Nutch and Eclipse can be a difficult companionship sometimes ;-)

=== eclipse: Cannot create project content in workspace ===
The nutch source code must be out of the workspace folder. My first attemp was 
download the code with eclipse (svn) under my workspace. When I try to create 
the project using existing code, eclipse don't let me do it from source code 
into the workspace. I use the source code out of my workspace and it work fine.

=== plugin dir not found ===
Make sure you set your plugin.folders property correct, instead of using a 
relative path you can use a absoluth one as well in nutch-defaults.xml or may 
be better in nutch-site.xml
{{{
property
  nameplugin.folders/name
  value/home/../nutch-0.8/src/plugin/value
}}}


=== No plugins loaded during unit tests in Eclipse ===

During unit testing, Eclipse ignored conf/nutch-site.xml in favor of 
src/test/nutch-site.xml, so you might need to add the plugin directory 
configuration to that file as well.


=== Unit tests work in eclipse but fail when running ant in the command line ===

Suppose your unit tests work perfectly in eclipse, but each and everyone fail 
when running '''ant test''' in the command line - including the ones you 
haven't modified.   Check if you defined the '''plugin.folders''' property in 
hadoop-site.xml. In that case, try removing it from that file and adding it 
directly to nutch-site.xml

Run '''ant test''' again.  That should have solved the problem.

If that didn't solve the problem, are you testing a plugin?  If so, did you add 
the plugin to the list of packages in plugin\build.xml, on the test target? 


=== classNotFound ===
 * open the class itself, rightclick
 * refresh the build dir

=== missing org.farng and com.etranslate ===
You may have problems with some imports in parse-mp3 and parse-rtf plugins. 
Because of incompatibility with apache licence they were left from sources. You 
can find it here:

[Nutch Wiki] Update of FAQ by KaiMiddleton

2007-08-10 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by KaiMiddleton:
http://wiki.apache.org/nutch/FAQ

--
  Assuming your index is located at /index :
  {{{% cd /index/
  % $CATATALINA_HOME/bin/startup.sh}}}
- '''Now you can search.''
+ '''Now you can search.'''
  
2) After building your first index, start and stop Tomcat which will make 
Tomcat extrat the Nutch webapp. Than you need to edit the nutch-site.xml and 
put in it the location of the index folder.
  {{{% $CATATALINA_HOME/bin/startup.sh
@@ -391, +391 @@

  /property
  }}}
  After that, __don't forget to crawl again__ and you should be able to 
retrieve the mime-type and content-length through the class HitDetails (via the 
fields primaryType, subType and contentLength) as you normally do for the 
title and URL of the hits.
-   (Note by DanielLopez) Thanks to DoÄacan GÃ¼ney for the tip.
+   (Note by DanielLopez) Thanks to Dogacan GÃ¼ney for the tip.
  
  === Crawling ===
  
@@ -399, +399 @@

  
  The crawl tool expects as its first parameter the folder name where the 
seeding urls file is located so for example if your urls.txt is located in 
/nutch/seeds the crawl command would look like: crawl seed -dir 
/user/nutchuser...
  
-  Some pages are not indexed but my regex file and everything else is okay 
- what is going on? 
+  Nutch doesn't crawl relative URLs? Some pages are not indexed but my 
regex file and everything else is okay - what is going on? 
  The crawl tool has a default limitation of 100 outlinks of one page that are 
being fetched.
- To overcome this limitation change the property to a higher value or simply 
-1 (unlimited).
+ To overcome this limitation change the '''db.max.outlinks.per.page''' 
property to a higher value or simply -1 (unlimited).
  
  file: conf/nutch-default.xml

[Nutch Wiki] Update of FAQ by ra

2007-08-10 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by ra:
http://wiki.apache.org/nutch/FAQ

--
  
  The crawl tool expects as its first parameter the folder name where the 
seeding urls file is located so for example if your urls.txt is located in 
/nutch/seeds the crawl command would look like: crawl seed -dir 
/user/nutchuser...
  
+  Some pages are not indexed but my regex file and everyhing else is okay 
- what is going on? 
+ The crawl tool has a default limitation of 100 outlinks of one page that are 
being fetched.
+ To overcome this limitation change the property to a higher value or simply 
-1.
+ 
+ file: conf/nutch-default.xml
+ 
+  property
+namedb.max.outlinks.per.page/name
+value-1/value
+descriptionThe maximum number of outlinks that we'll process for a page.
+If this value is nonnegative (=0), at most db.max.outlinks.per.page 
outlinks
+will be processed for a page; otherwise, all outlinks will be processed.
+/description
+  /property 
+ 
+ see also: http://www.mail-archive.com/[EMAIL PROTECTED]/msg08665.html
+ 
+ 
+ 
+ 
+ 
  === Discussion ===
  
  [http://grub.org/ Grub] has some interesting ideas about building a search 
engine using distributed computing. ''And how is that relevant to nutch?''

[Nutch Wiki] Update of FAQ by ra

2007-08-10 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by ra:
http://wiki.apache.org/nutch/FAQ

--
  
  The crawl tool expects as its first parameter the folder name where the 
seeding urls file is located so for example if your urls.txt is located in 
/nutch/seeds the crawl command would look like: crawl seed -dir 
/user/nutchuser...
  
-  Some pages are not indexed but my regex file and everyhing else is okay 
- what is going on? 
+  Some pages are not indexed but my regex file and everything else is okay 
- what is going on? 
  The crawl tool has a default limitation of 100 outlinks of one page that are 
being fetched.
- To overcome this limitation change the property to a higher value or simply 
-1.
+ To overcome this limitation change the property to a higher value or simply 
-1 (unlimited).
  
  file: conf/nutch-default.xml
+ 
  {{{
   property
 namedb.max.outlinks.per.page/name
@@ -415, +416 @@

   /property 
  }}}
  see also: http://www.mail-archive.com/[EMAIL PROTECTED]/msg08665.html
- 
+ (tested under nutch 0.9)

[Nutch Wiki] Update of FAQ by ra

2007-08-10 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by ra:
http://wiki.apache.org/nutch/FAQ

--
  To overcome this limitation change the property to a higher value or simply 
-1.
  
  file: conf/nutch-default.xml
- 
+ {{{
   property
 namedb.max.outlinks.per.page/name
 value-1/value
@@ -413, +413 @@

 will be processed for a page; otherwise, all outlinks will be processed.
 /description
   /property 
- 
+ }}}
  see also: http://www.mail-archive.com/[EMAIL PROTECTED]/msg08665.html

[Nutch Wiki] Update of Support by ThomasDelnoij

2007-08-06 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by ThomasDelnoij:
http://wiki.apache.org/nutch/Support

--
* [http://www.sigram.com Andrzej Bialecki] ab at sigram.com
* CNLP  http://www.cnlp.org/tech/lucene.asp
* [http://www.doculibre.com/ Doculibre Inc.] Open source and information 
management consulting. (Lucene, Nutch, Hadoop etc.) info at doculibre.com
+   * [http://www.dsen.nl DSEN - Java | J2EE | Agile Development  Consultancy]
* eventax GmbH info at eventax.com
* [http://www.foofactory.fi/ FooFactory] / Sami Siren info at foofactory 
dot fi
* [http://www.lucene-consulting.com/ Lucene Consulting] / Otis Gospodnetic 
otis at apache.org

[Nutch Wiki] Update of GettingNutchRunningWithWindows by JamesVictor

2007-07-12 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by JamesVictor:
http://wiki.apache.org/nutch/GettingNutchRunningWithWindows

The comment on the change is:
removed index-more from example; it threw an exception on my indexing

--
  
  Edit `conf/nutch-site.xml` and change the value of `plugin.includes` to 
include the plugins for the document types that you want Nutch to handle.
  
- For example, to add parsing for PDF, MS Office, and OpenOffice documents, and 
use the `index-more` instead of `index-basic`, you'll have something like:
+ Example: to add parsing for PDF, MS Office, and OpenOffice documents, you'll 
have something like:
  
  {{{
  property
nameplugin.includes/name

valueprotocol-http|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf|swf|zip)|
- index-more|query-(basic|site|url)|summary-basic|scoring-opic|
+ index-basic|query-(basic|site|url)|summary-basic|scoring-opic|
  urlnormalizer-(pass|regex|basic)/value
  /property
  }}}

1 2 3 4 5 6 7 >

1 - 100 of 648 matches

Mail list logo