[Nutch Wiki] Update of RunningNutchAndSolr by Dmitriu s

2010-05-10 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The RunningNutchAndSolr page has been changed by Dmitrius.
The comment on this change is: Fixed commang (single quotes missed).
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diffrev1=28rev2=29

--

  = New in Nutch 1.0-dev =
- Please note that in the nightly version of Apache Nutch there is now a Solr 
integration embedded so you can start to use a lot easier. Just download a 
nightly version from [[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/]].
+ Please note that in the nightly version of Apache Nutch there is now a Solr 
integration embedded so you can start to use a lot easier. Just download a 
nightly version from http://hudson.zones.apache.org/hudson/job/Nutch-trunk/.
  
  = Pre Solr Nutch integration =
- This is just a quick first pass at a guide for getting Nutch running with 
Solr.  I'm sure there are better ways of doing some/all of it, but I'm not 
aware of them.  By all means, please do correct/update this if someone has a 
better idea.  Many thanks to [[http://variogram.com||Brian Whitman at 
Variogr.am]] and [[http://blog.foofactory.fi||Sami Siren at FooFactory]] for 
all the help!  You guys saved me a lot of time! :)
+ This is just a quick first pass at a guide for getting Nutch running with 
Solr.  I'm sure there are better ways of doing some/all of it, but I'm not 
aware of them.  By all means, please do correct/update this if someone has a 
better idea.  Many thanks to http://variogram.com and http://blog.foofactory.fi 
for all the help!  You guys saved me a lot of time! :)
  
  I'm posting it under Nutch rather than Solr on the presumption that people 
are more likely to be learning/using Solr first, then come here looking to 
combine it with Nutch.  I'm going to skip over doing command by command for 
right now.  I'm running/building on Ubuntu 7.10 using Java 1.6.0_05.  I'm 
assuming that the Solr trunk code is checked out into solr-trunk and Nutch 
trunk code is checked out into nutch-trunk.
  
@@ -12, +12 @@

   * apt-get install sun-java6-jdk subversion ant patch unzip
  
  == Steps ==
- 
  The first step to get started is to download the required software 
components, namely Apache Solr and Nutch.
  
  '''1.''' Download Solr version 1.3.0 or LucidWorks for Solr from Download page
@@ -23, +22 @@

  
  '''4.''' Extract the Nutch package   tar xzf apache-nutch-1.0.tar.gz
  
+ '''5.''' Configure Solr For the sake of simplicity we are going to use the 
example configuration of Solr as a base.
- '''5.''' Configure Solr
- For the sake of simplicity we are going to use the example
- configuration of Solr as a base.
  
- '''a.''' Copy the provided Nutch schema from directory
- apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf 
(override the existing file)
+ '''a.''' Copy the provided Nutch schema from directory apache-nutch-1.0/conf 
to directory apache-solr-1.3.0/example/solr/conf (override the existing file)
  
  We want to allow Solr to create the snippets for search results so we need to 
store the content in addition to indexing it:
  
@@ -52, +48 @@

  
  str name=qf
  
- content^0.5 anchor^1.0 title^1.2
+ content^0.5 anchor^1.0 title^1.2 /str
- /str
  
- str name=pf
- content^0.5 anchor^1.5 title^1.2 site^1.5
+ str name=pf content^0.5 anchor^1.5 title^1.2 site^1.5 /str
- /str
  
+ str name=fl url /str
- str name=fl
- url
- /str
  
+ str name=mm 2-1 5-2 690% /str
- str name=mm
- 2lt;-1 5lt;-2 6lt;90%
- /str
  
  int name=ps100/int
  
@@ -91, +80 @@

  
  '''6.''' Start Solr
  
+ cd apache-solr-1.3.0/example java -jar start.jar
- cd apache-solr-1.3.0/example
- java -jar start.jar
  
  '''7. Configure Nutch'''
  
  a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s 
contents with the following (we specify our crawler name, active plugins and 
limit maximum url count for single host per run to be 100) :
  
+ ?xml version=1.0? configuration
- ?xml version=1.0?
- configuration
  
  property
  
@@ -109, +96 @@

  
  /property
  
- property
- namegenerate.max.per.host/name
+ property namegenerate.max.per.host/name
  
  value100/value
  
@@ -126, +112 @@

  
  /configuration
  
- 
  '''b.''' Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,replace 
it’s content with following:
  
  -^(https|telnet|file|ftp|mailto):
+ 
-  
- # skip some suffixes
- 
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
+ # skip some suffixes 
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
-  
+ 
- # skip URLs 

[Nutch Wiki] Update of RunningNutchAndSolr by Dmitriu s

2010-05-10 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The RunningNutchAndSolr page has been changed by Dmitrius.
The comment on this change is: It's a problem to make wiki to display grave 
assent. Managed to do that using html codes.
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diffrev1=29rev2=30

--

  
  The above command will generate a new segment directory under crawl/segments 
that at this point contains files that store the url(s) to be fetched. In the 
following commands we need the latest segment dir as parameter so we’ll store 
it in an environment variable:
  
- export SEGMENT=crawl/segments/``ls -tr crawl/segments|tail -1``
+ export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
  
  Now I launch the fetcher that actually goes to get the content:
  


[Nutch Wiki] Update of RunningNutchAndSolr by Dmitriu s

2010-05-10 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The RunningNutchAndSolr page has been changed by Dmitrius.
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diffrev1=30rev2=31

--

  
  The above command will generate a new segment directory under crawl/segments 
that at this point contains files that store the url(s) to be fetched. In the 
following commands we need the latest segment dir as parameter so we’ll store 
it in an environment variable:
  
- export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
+ export SEGMENT=crawl/segments/#96;ls -tr crawl/segments|tail -1#96;
  
  Now I launch the fetcher that actually goes to get the content:
  


[Nutch Wiki] Update of FrontPage by JulienNioche

2010-04-07 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FrontPage page has been changed by JulienNioche.
http://wiki.apache.org/nutch/FrontPage?action=diffrev1=128rev2=129

--

   * [[Mailing]] Lists
   * AcademicArticles that deal with Nutch
   * http://videolectures.net/iiia06_cutting_ense/| Experiences with the Nutch 
search engine author:Doug   Cutting,Video Lecture
- 
  
  == Nutch Administration ==
   * DownloadingNutch
@@ -89, +88 @@

   * TikaPlugin - Comments on the Tika integration and differences with 
existing parse plugins
  
  == Nutch 2.0 ==
+  * Nutch2Roadmap -- Discussions on the architecture and features of Nutch 2.0
-  * Nutch2Architecture -- Discussions on the Nutch 2.0 architecture.
+  * Nutch2Architecture -- Discussions on the Nutch 2.0 architecture (old)
   * NewScoring -- New stable pagerank like webgraph and link-analysis jobs.
   * NewScoringIndexingExample -- Two full fetch cycles of commands using new 
scoring and indexing systems.
  


[Nutch Wiki] Update of Nutch2Roadmap by JulienNioche

2010-04-07 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The Nutch2Roadmap page has been changed by JulienNioche.
http://wiki.apache.org/nutch/Nutch2Roadmap

--

New page:
= Nutch2Roadmap =

Here is a list of the features and architectural changes that will be 
implemented in Nutch 2.0.

  * Storage Abstraction
* initially with back end implementations for HBase and HDFS  
* extend it to other storages later e.g. MySQL etc...
  * Plugin cleanup : Tika only for parsing document formats
* keep only stuff HtmlParseFilters (probably with a different API) so that 
we can post-process the DOM created in Tika from whatever original format.
  * Externalize functionalities to crawler-commons project 
[http://code.google.com/p/crawler-commons/] 
* robots handling, url filtering and url normalization, URL state 
management, perhaps deduplication. We should coordinate our efforts, and share 
code freely so that other projects (bixo, heritrix,droids) may contribute to 
this shared pool of functionality, much like Tika does for the common need of 
parsing complex formats.
  * Remove index / search and delegate to SOLR
* we may still keep a thin abstract layer to allow other indexing/search 
backends (ElasticSearch?), but the current mess of indexing/query filters and 
competing indexing frameworks (lucene, fields, solr) should go away. We should 
go directly from DOM to a NutchDocument, and stop there.
  * Various new functionalities 
* e.g. sitemap support, canonical tag, better handling of redirects, 
detecting duplicated sites, detection of spam cliques, tools to manage the 
webgraph, etc.


This document is meant to serve as a basis for discussion, feel free to 
contribute to it


[Nutch Wiki] Update of Nutch2Roadmap by JulienNioche

2010-04-07 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The Nutch2Roadmap page has been changed by JulienNioche.
http://wiki.apache.org/nutch/Nutch2Roadmap?action=diffrev1=1rev2=2

--

* Storage Abstraction
  * initially with back end implementations for HBase and HDFS  
  * extend it to other storages later e.g. MySQL etc...
-   * Plugin cleanup : Tika only for parsing document formats
+   * Plugin cleanup : Tika only for parsing document formats (see 
http://wiki.apache.org/nutch/TikaPlugin)
  * keep only stuff HtmlParseFilters (probably with a different API) so 
that we can post-process the DOM created in Tika from whatever original format.
* Externalize functionalities to crawler-commons project 
[http://code.google.com/p/crawler-commons/] 
  * robots handling, url filtering and url normalization, URL state 
management, perhaps deduplication. We should coordinate our efforts, and share 
code freely so that other projects (bixo, heritrix,droids) may contribute to 
this shared pool of functionality, much like Tika does for the common need of 
parsing complex formats.


[Nutch Wiki] Update of FAQ by Ankit Dangi

2010-03-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FAQ page has been changed by Ankit Dangi.
http://wiki.apache.org/nutch/FAQ?action=diffrev1=115rev2=116

--

  TableOfContents
  
  == Nutch FAQ ==
- 
  === General ===
- 
   Are there any mailing lists available? 
- 
  There's a user, developer, commits and agents lists, all available at 
http://lucene.apache.org/nutch/mailing_lists.html.
  
   How can I stop Nutch from crawling my site? 
- 
  Please visit our [[http://lucene.apache.org/nutch/bot.html|webmaster info 
page]]
  
   Will Nutch be a distributed, P2P-based search engine? 
- 
  We don't think it is presently possible to build a peer-to-peer search engine 
that is competitive with existing search engines. It would just be too slow. 
Returning results in less than a second is important: it lets people rapidly 
reformulate their queries so that they can more often find what they're looking 
for. In short, a fast search engine is a better search engine. I don't think 
many people would want to use a search engine that takes ten or more seconds to 
return results.
  
  That said, if someone wishes to start a sub-project of Nutch exploring 
distributed searching, we'd love to host it. We don't think these techniques 
are likely to solve the hard problems Nutch needs to solve, but we'd be happy 
to be proven wrong.
  
- 
   Will Nutch use a distributed crawler, like Grub? 
- 
  Distributed crawling can save download bandwidth, but, in the long run, the 
savings is not significant. A successful search engine requires more bandwidth 
to upload query result pages than its crawler needs to download pages, so 
making the crawler use less bandwidth does not reduce overall bandwidth 
requirements. The dominant expense of operating a large search engine is not 
crawling, but searching.
  
   Won't open source just make it easier for sites to manipulate rankings? 

- 
  Search engines work hard to construct ranking algorithms that are immune to 
manipulation. Search engine optimizers still manage to reverse-engineer the 
ranking algorithms used by search engines, and improve the ranking of their 
pages. For example, many sites use link farms to manipulate search engines' 
link-based ranking algorithms, and search engines retaliate by improving their 
link-based algorithms to neutralize the effect of link farms.
  
  With an open-source search engine, this will still happen, just out in the 
open. This is analagous to encryption and virus protection software. In the 
long term, making such algorithms open source makes them stronger, as more 
people can examine the source code to find flaws and suggest improvements. Thus 
we believe that an open source search engine has the potential to better resist 
manipulation of its rankings.
  
   What Java version is required to run Nutch? 
- 
  Nutch 0.7 will run with Java 1.4 and up.
  
   Exception: java.net.SocketException: Invalid argument or cannot assign 
requested address on Fedora Core 3 or 4 
- 
  It seems you have installed IPV6 on your machine.
  
  To solve this problem, add the following java param to the java instantiation 
in bin/nutch:
  
  JAVA_IPV4=-Djava.net.preferIPv4Stack=true
  
- # run it
- exec $JAVA $JAVA_HEAP_MAX $NUTCH_OPTS $JAVA_IPV4 -classpath $CLASSPATH 
$CLASS $@
+ # run it exec $JAVA $JAVA_HEAP_MAX $NUTCH_OPTS $JAVA_IPV4 -classpath 
$CLASSPATH $CLASS $@
  
   I have two XML files, nutch-default.xml and nutch-site.xml, why? 
+ nutch-default.xml is the out of the box configuration for nutch. Most 
configuration can (and should unless you know what your doing) stay as it is. 
nutch-site.xml is where you make the changes that override the default 
settings. The same goes to the servlet container application.
- 
- nutch-default.xml is the out of the box configuration for nutch. Most 
configuration can (and should unless you know what your doing) stay as it is.
- nutch-site.xml is where you make the changes that override the default 
settings.
- The same goes to the servlet container application.
  
   My system does not find the segments folder. Why? Or: How do I tell the 
''Nutch Servlet'' where the index file are located? 
- 
  There are at least two choices to do that:
  
-   First you need to copy the .WAR file to the servlet container webapps 
folder.
+  . First you need to copy the .WAR file to the servlet container webapps 
folder.
+ 
  {{{
 % cp nutch-0.7.war $CATALINA_HOME/webapps/ROOT.war
  }}}
- 
-   1) After building your first index, start Tomcat from the index folder.
+  . 1) After building your first index, start Tomcat from the index folder.
- Assuming your index is located at /index :
+   . Assuming your index is located at /index :
+ 
  {{{
  % cd /index/
  % $CATATALINA_HOME/bin/startup.sh
  }}}
- '''Now you can search.'''
+  . '''Now 

[Nutch Wiki] Update of Support by Christopher Bader

2010-03-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The Support page has been changed by Christopher Bader.
http://wiki.apache.org/nutch/Support?action=diffrev1=48rev2=49

--

* [[http://www.dsen.nl|Thomas Delnoij (DSEN) - Java | J2EE | Agile 
Development  Consultancy]]
* eventax GmbH info at eventax.com
* [[http://www.foofactory.fi/|FooFactory]] / Sami Siren info at foofactory 
dot fi
+   * [[http://www.kratylos.com/|Kratylos Technologies]] - Consulting 
development, and tech support for open source search and speech.
* [[http://www.lucene-consulting.com/|Lucene Consulting]] - Nutch, Solr, 
Lucene, Hadoop consulting and development.  Founded by Otis Gospodnetic, 
[[http://www.amazon.com/Lucene-Action-Otis-Gospodnetic/dp/1932394281|Lucene in 
Action]] co-author.
* Stefan Groschupf sg at media-style.com
* Michael Nebel mn at nebel.de (germany preferred)


[Nutch Wiki] Update of HttpAuthenticationSchemes by s usam

2010-03-15 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The HttpAuthenticationSchemes page has been changed by susam.
http://wiki.apache.org/nutch/HttpAuthenticationSchemes?action=diffrev1=18rev2=19

--

  === Important Points ===
   1. For authscope tag, 'host' and 'port' attribute should always be 
specified. 'realm' and 'scheme' attributes may or may not be specified 
depending on your needs. If you are tempted to omit the 'host' and 'port' 
attribute, because you want the credentials to be used for any host and any 
port for that realm/scheme, please use the 'default' tag instead. That's what 
'default' tag is meant for.
   1. One authentication scope should not be defined twice as different 
authscope tags for different credentials tag. However, if this is done by 
mistake, the credentials for the last defined authscope tag would be used. 
This is because, the XML parsing code, reads the file from top to bottom and 
sets the credentials for authentication-scopes. If the same authentication 
scope is encountered once again, it will be overwritten with the new 
credentials. However, one should not rely on this behavior as this might change 
with further developments.
-  1. Do not define multiple authscope tags with the same host, port but 
different realms if the server requires NTLM authentication. This means there 
should not be multiple tags with same host, port, scheme=NTLM but different 
realms. If you are omitting the scheme attribute and the server requires NTLM 
authentication, then there should not be multiple tags with same host, port but 
different realms. This is discussed more in the next section.
+  1. Do not define multiple authscope tags with the same host, port but 
different realms if the server requires NTLM authentication. This means there 
should not be multiple authscope tags with same host, port, scheme=NTLM but 
different realms. If you are omitting the scheme attribute and the server 
requires NTLM authentication, then there should not be multiple tags with same 
host, port but different realms. This is discussed more in the next section.
   1. If you are using NTLM scheme, you should also set the 'http.agent.host' 
property in conf/nutch-site.xml
  
  === A note on NTLM domains ===
  NTLM does not use the concept of realms. Therefore, multiple realms for a 
web-server can not be defined as different authentication scopes for the same 
web-server requiring NTLM authentication. There should be exactly one authscope 
tag for NTLM scheme authentication scope for a particular web-server. The 
authentication domain should be specified as the value of the 'realm' 
attribute. NTLM authentication also requires the name of IP address of the host 
on which the crawler is running. Thus, 'http.agent.host' should be set properly.
  
  == Underlying HttpClient Library ==
- 'protocol-httpclient' is based on 
[[http://jakarta.apache.org/httpcomponents/httpclient-3.x/|Jakarta Commons 
HttpClient]]. Some servers support multiple schemes for authenticating users. 
Given that only one scheme may be used at a time for authenticating, it must 
choose which scheme to use. To accompish this, it uses an order of preference 
to select the correct authentication scheme. By default this order is: NTLM, 
Digest, Basic. For more information on the behavior during authentication, you 
might want to read the 
[[http://jakarta.apache.org/httpcomponents/httpclient-3.x/authentication.html|HttpClient
 Authentication Guide]].
+ 'protocol-httpclient' is based on 
[[http://hc.apache.org/httpclient-3.x/|Jakarta Commons HttpClient]]. Some 
servers support multiple schemes for authenticating users. Given that only one 
scheme may be used at a time for authenticating, it must choose which scheme to 
use. To accomplish this, it uses an order of preference to select the correct 
authentication scheme. By default this order is: NTLM, Digest, Basic. For more 
information on the behavior during authentication, you might want to read the 
[[http://hc.apache.org/httpclient-3.x/authentication.html|HttpClient 
Authentication Guide]].
  
  == Need Help? ==
  If you need help, please feel free to post your question to the 
[[http://lucene.apache.org/nutch/mailing_lists.html#Users|nutch-user mailing 
list]]. The author of this work, Susam Pal, usually responds to mails related 
to authentication problems. The DEBUG logs may be required to troubleshoot the 
problem. You must enable the debug log for 'protocol-httpclient' before running 
the crawler. To enable debug log for 'protocol-httpclient', open 
'conf/log4j.properties' and add the following line:


[Nutch Wiki] Update of HttpAuthenticationSchemes by s usam

2010-03-15 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The HttpAuthenticationSchemes page has been changed by susam.
The comment on this change is: Added suggestion to enable debug for for Jakarta 
Commons HttpClient.
http://wiki.apache.org/nutch/HttpAuthenticationSchemes?action=diffrev1=19rev2=20

--

  'protocol-httpclient' is based on 
[[http://hc.apache.org/httpclient-3.x/|Jakarta Commons HttpClient]]. Some 
servers support multiple schemes for authenticating users. Given that only one 
scheme may be used at a time for authenticating, it must choose which scheme to 
use. To accomplish this, it uses an order of preference to select the correct 
authentication scheme. By default this order is: NTLM, Digest, Basic. For more 
information on the behavior during authentication, you might want to read the 
[[http://hc.apache.org/httpclient-3.x/authentication.html|HttpClient 
Authentication Guide]].
  
  == Need Help? ==
- If you need help, please feel free to post your question to the 
[[http://lucene.apache.org/nutch/mailing_lists.html#Users|nutch-user mailing 
list]]. The author of this work, Susam Pal, usually responds to mails related 
to authentication problems. The DEBUG logs may be required to troubleshoot the 
problem. You must enable the debug log for 'protocol-httpclient' before running 
the crawler. To enable debug log for 'protocol-httpclient', open 
'conf/log4j.properties' and add the following line:
+ If you need help, please feel free to post your question to the 
[[http://lucene.apache.org/nutch/mailing_lists.html#Users|nutch-user mailing 
list]]. The author of this work, Susam Pal, usually responds to mails related 
to authentication problems. The DEBUG logs may be required to troubleshoot the 
problem. You must enable the debug log for 'protocol-httpclient' and Jakarta 
Commons !HttpClient before running the crawler. To enable debug log for 
'protocol-httpclient' and !HttpClient, open 'conf/log4j.properties' and add the 
following line:
  {{{
  log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
+ log4j.logger.org.apache.commons.httpclient.auth=DEBUG,cmdstdout
  }}}
  
  It would be good to check the following things before asking for help.


[Nutch Wiki] Trivial Update of HttpAuthenticationSchemes by susam

2010-03-15 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The HttpAuthenticationSchemes page has been changed by susam.
The comment on this change is: Added a link to my website.
http://wiki.apache.org/nutch/HttpAuthenticationSchemes?action=diffrev1=20rev2=21

--

  'protocol-httpclient' is based on 
[[http://hc.apache.org/httpclient-3.x/|Jakarta Commons HttpClient]]. Some 
servers support multiple schemes for authenticating users. Given that only one 
scheme may be used at a time for authenticating, it must choose which scheme to 
use. To accomplish this, it uses an order of preference to select the correct 
authentication scheme. By default this order is: NTLM, Digest, Basic. For more 
information on the behavior during authentication, you might want to read the 
[[http://hc.apache.org/httpclient-3.x/authentication.html|HttpClient 
Authentication Guide]].
  
  == Need Help? ==
- If you need help, please feel free to post your question to the 
[[http://lucene.apache.org/nutch/mailing_lists.html#Users|nutch-user mailing 
list]]. The author of this work, Susam Pal, usually responds to mails related 
to authentication problems. The DEBUG logs may be required to troubleshoot the 
problem. You must enable the debug log for 'protocol-httpclient' and Jakarta 
Commons !HttpClient before running the crawler. To enable debug log for 
'protocol-httpclient' and !HttpClient, open 'conf/log4j.properties' and add the 
following line:
+ If you need help, please feel free to post your question to the 
[[http://lucene.apache.org/nutch/mailing_lists.html#Users|nutch-user mailing 
list]]. The author of this work, [[http://susam.in/|Susam Pal]], usually 
responds to mails related to authentication problems. The DEBUG logs may be 
required to troubleshoot the problem. You must enable the debug log for 
'protocol-httpclient' and Jakarta Commons !HttpClient before running the 
crawler. To enable debug log for 'protocol-httpclient' and !HttpClient, open 
'conf/log4j.properties' and add the following line:
  {{{
  log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
  log4j.logger.org.apache.commons.httpclient.auth=DEBUG,cmdstdout


[Nutch Wiki] Trivial Update of Crawl by susam

2010-03-15 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The Crawl page has been changed by susam.
The comment on this change is: Fixed wiki markup for codes.
http://wiki.apache.org/nutch/Crawl?action=diffrev1=8rev2=9

--

  === NUTCH_HOME ===
  If you are not executing the script as 'bin/runbot' from Nutch directory, you 
should either set the environment variable 'NUTCH_HOME' or edit the following 
in the script:-
  
+ {{{
- {{{if [ -z $NUTCH_HOME ]
+ if [ -z $NUTCH_HOME ]
  then
-   NUTCH_HOME=.}}}
+   NUTCH_HOME=.
+ }}}
  
  Set 'NUTCH_HOME' to the path of the Nutch directory (if you are not setting 
it as an environment variable, since if environment variable is set, the above 
assignment is ignored).
  
  === CATALINA_HOME ===
  'CATALINA_HOME' points to the Tomcat installation directory. You must either 
set this as an environment variable or set it by editing the following lines in 
the script:-
  
+ {{{
- {{{if [ -z $CATALINA_HOME ]
+ if [ -z $CATALINA_HOME ]
  then
-   CATALINA_HOME=/opt/apache-tomcat-6.0.10}}}
+   CATALINA_HOME=/opt/apache-tomcat-6.0.10
+ }}}
  
  Similar to the previous section, if this variable is set in the environment, 
then the above assignment is ignored.
  


[Nutch Wiki] Trivial Update of Crawl by susam

2010-03-15 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The Crawl page has been changed by susam.
The comment on this change is: Fixed typo.
http://wiki.apache.org/nutch/Crawl?action=diffrev1=9rev2=10

--

  Similar to the previous section, if this variable is set in the environment, 
then the above assignment is ignored.
  
  == Can it re-crawl? ==
- The author has used this script to re-crawl a couple of times. However, no 
real world testing has been done for re-crawling. Therefore, you may try to use 
the script of re-crawl. If it works out fine or it doesn't work properly for 
re-crawl, please let us know.
+ The author has used this script to re-crawl a couple of times. However, no 
real world testing has been done for re-crawling. Therefore, you may try to use 
the script for re-crawl. If it works fine or it doesn't work properly for 
re-crawl, please let us know.
  
  == Script ==
  {{{


[Nutch Wiki] Update of Becoming_A_Nutch_Developer by maqboolzee

2010-03-02 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The Becoming_A_Nutch_Developer page has been changed by maqboolzee.
http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer?action=diffrev1=7rev2=8

--

  
   * [[http://www.mail-archive.com/index.php?hunt=nutch|Nutch Mail Archive]]
   * [[http://www.nabble.com/forum/Search.jtp?query=nutch|Nabble Nutch]]
+  * [[http://search.lucidimagination.com/search/#/p:nutch][Lucid Imagination 
Email]]
  
  When searching the list for errors you have received it is good to search 
both by component, for example fetcher, and by the actual error received.  If 
you are not finding the answers you are looking for on the list, you may want 
to move to the JIRA and search there for answers.
  


[Nutch Wiki] Update of Becoming_A_Nutch_Developer by maqboolzee

2010-03-02 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The Becoming_A_Nutch_Developer page has been changed by maqboolzee.
http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer?action=diffrev1=8rev2=9

--

  
   * [[http://www.mail-archive.com/index.php?hunt=nutch|Nutch Mail Archive]]
   * [[http://www.nabble.com/forum/Search.jtp?query=nutch|Nabble Nutch]]
-  * [[http://search.lucidimagination.com/search/#/p:nutch][Lucid Imagination 
Email]]
+  * [[http://search.lucidimagination.com/search/#/p:nutch|Lucid Imagination 
Email]]
  
  When searching the list for errors you have received it is good to search 
both by component, for example fetcher, and by the actual error received.  If 
you are not finding the answers you are looking for on the list, you may want 
to move to the JIRA and search there for answers.
  


New attachment added to page Evaluations on Nutch Wiki

2010-02-26 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page Evaluations for change notification. An 
attachment has been added to that page by IvanKelly. Following detailed 
information is available:

Attachment name: OSU_Queries.pdf
Attachment size: 77705
Attachment link: 
http://wiki.apache.org/nutch/Evaluations?action=AttachFiledo=gettarget=OSU_Queries.pdf
Page link: http://wiki.apache.org/nutch/Evaluations


[Nutch Wiki] Update of Evaluations by IvanKelly

2010-02-26 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The Evaluations page has been changed by IvanKelly.
http://wiki.apache.org/nutch/Evaluations?action=diffrev1=4rev2=5

--

  -- DougCutting - 29 Jun 2004
  
  
||'''Attachment:'''||'''Action:'''||'''Size:'''||'''Date:'''||'''Who:'''||'''Comment:'''||
- 
||[[http://www.nutch.org/twiki/Main/Evaluations/OSU_Queries.pdf|OSU_Queries.pdf]]||action||77705||29
 Jun 2004 - 17:07||DougCutting||OSU evaluation by Lyle Benedict||
+ ||[[attachment:OSU_Queries.pdf|OSU_Queries.pdf]]||action||77705||29 Jun 2004 
- 17:07||DougCutting||OSU evaluation by Lyle Benedict||
  ||   || ||   || ||   ||   
 
  


[Nutch Wiki] Update of RunNutchInEclipse1.0 by maqb oolzee

2010-02-20 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The RunNutchInEclipse1.0 page has been changed by maqboolzee.
http://wiki.apache.org/nutch/RunNutchInEclipse1.0?action=diffrev1=15rev2=16

--

  
   1. Execute 'ant job' (which is the default) after downloading nutch through 
SVN
  
-  1. Update plugin.folders (under nutch-default.xml) to 
ECLIPSE_OUTPUT_FOLDER/plugins
+  1. Update plugin.folders (under nutch-default.xml) to build/plugins (where 
ant builds plugins)
  
   1. If it still fails increase your memory allocation or find a simpler 
website to crawl.
  


[Nutch Wiki] Update of RunNutchInEclipse1.0 by maqb oolzee

2010-02-19 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The RunNutchInEclipse1.0 page has been changed by maqboolzee.
The comment on this change is: plugins BasicURLNormalizer exception resolution.
http://wiki.apache.org/nutch/RunNutchInEclipse1.0?action=diffrev1=13rev2=14

--

  = Run Nutch In Eclipse on Linux and Windows nutch version 1.0 =
- 
  This is a work in progress. If you find errors or would like to improve this 
page, just create an account [UserPreferences] and start editing this page :-)
  
  == Tested with ==
@@ -12, +11 @@

   * Windows XP and Vista
  
  == Before you start ==
- 
  Setting up Nutch to run into Eclipse can be tricky, and most of the time it 
is much faster if you edit Nutch in Eclipse but run the scripts from the 
command line (my 2 cents). However, it's very useful to be able to debug Nutch 
in Eclipse. Sometimes examining the logs (logs/hadoop.log) is quicker to debug 
a problem.
  
- 
  == Steps ==
- 
- 
  === For Windows Users ===
- 
  If you are running Windows (tested on Windows XP) you must first install 
cygwin. Download it from http://www.cygwin.com/setup.exe
  
  Install cygwin and set the PATH environment variable for it. You can set it 
from the Control Panel, System, Advanced Tab, Environment Variables and 
edit/add PATH.
  
  Example PATH:
+ 
  {{{
  C:\Sun\SDK\bin;C:\cygwin\bin
  }}}
  If you run bash from the Windows command line (Start  Run...  cmd.exe) it 
should successfully run cygwin.
  
  If you are running Eclipse on Vista, you will need to either give cygwin 
administrative privileges or 
[[http://www.mydigitallife.info/2006/12/19/turn-off-or-disable-user-account-control-uac-in-windows-vista/|turn
 off Vista's User Access Control (UAC)]]. Otherwise Hadoop will likely complain 
that it cannot change a directory permission when you later run the crawler:
+ 
  {{{
  org.apache.hadoop.util.Shell$ExitCodeException: chmod: changing permissions 
of ... Permission denied
  }}}
- 
- See 
[[http://markmail.org/message/ymgygimtvuksn2ic#query:Exception%20in%20thread%20main%20org.apache.hadoop.util.Shell%24ExitCodeException%3A%20chmod%3A%20changing%20permissions+page:1+mid:pj3spjhvdtjx736q+state:results|this]]
 for more information about the UAC issue.
+ See 
[[http://markmail.org/message/ymgygimtvuksn2ic#query:Exception%20in%20thread%20main%20org.apache.hadoop.util.Shell$ExitCodeException:%20chmod:%20changing%20permissions+page:1+mid:pj3spjhvdtjx736q+state:results|this]]
 for more information about the UAC issue.
  
  === Install Nutch ===
-  * Grab a [[http://lucene.apache.org/nutch/version_control.html|fresh 
release]] of Nutch 1.0 or download and untar the 
[[http://lucene.apache.org/nutch/release/|official 1.0 release]]. 
+  * Grab a [[http://lucene.apache.org/nutch/version_control.html|fresh 
release]] of Nutch 1.0 or download and untar the 
[[http://lucene.apache.org/nutch/release/|official 1.0 release]].
   * Do not build Nutch yet. Make sure you have no .project and .classpath 
files in the Nutch directory
- 
  
  === Create a new Java Project in Eclipse ===
   * File  New  Project  Java project  click Next
   * Name the project (Nutch_Trunk for instance)
   * Select Create project from existing source and use the location where 
you downloaded Nutch
   * Click on Next, and wait while Eclipse is scanning the folders
-  * Add the folder conf to the classpath (Right-click on the project, select 
properties then Java Build Path tab (left menu) and then the Libraries 
tab. Click Add Class Folder... button, and select conf from the list) 
+  * Add the folder conf to the classpath (Right-click on the project, select 
properties then Java Build Path tab (left menu) and then the Libraries 
tab. Click Add Class Folder... button, and select conf from the list)
   * Go to Order and Export tab, find the entry for added conf folder and 
move it to the top (by checking it and clicking the Top button). This is 
required so Eclipse will take config (nutch-default.xml, nutch-final.xml, etc.) 
resources from our conf folder and not from somewhere else.
-  * Eclipse should have guessed all the Java files that must be added to your 
classpath. If that's not the case, add src/java, src/test and all plugin 
src/java and src/test folders to your source folders. Also add all jars in 
lib and in the plugin lib folders to your libraries 
+  * Eclipse should have guessed all the Java files that must be added to your 
classpath. If that's not the case, add src/java, src/test and all plugin 
src/java and src/test folders to your source folders. Also add all jars in 
lib and in the plugin lib folders to your libraries
   * Click the Source tab and set the default output folder to 
Nutch_Trunk/bin/tmp_build. (You may need to create the tmp_build folder.)
   * Click the Finish button
   * DO NOT add build to classpath
  
- 
  === Configure Nutch ===
   * See the 

[Nutch Wiki] Update of RunNutchInEclipse1.0 by maqb oolzee

2010-02-19 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The RunNutchInEclipse1.0 page has been changed by maqboolzee.
http://wiki.apache.org/nutch/RunNutchInEclipse1.0?action=diffrev1=14rev2=15

--

  === NOTE: Additional note for people who want to run eclipse with latest 
nutch code ===
  If you are getting following exception - 
org.apache.nutch.plugin.PluginRuntimeException: 
java.lang.ClassNotFoundException: 
[[http://org.apache.nutch.net.urlnormalizer.basic.ba/|org.apache.nutch.net]].urlnormalizer.basic.BasicURLNormalizer
  
- 1. Execute 'ant job' (which is the default) after downloading nutch through 
SVN
+  1. Execute 'ant job' (which is the default) after downloading nutch through 
SVN
+ 
- 2. Update plugin.folders (under nutch-default.xml) to 
ECLIPSE_OUTPUT_FOLDER/plugins
+  1. Update plugin.folders (under nutch-default.xml) to 
ECLIPSE_OUTPUT_FOLDER/plugins
+ 
- 3. If it still fails increase your memory allocation or find a simpler 
website to crawl.
+  1. If it still fails increase your memory allocation or find a simpler 
website to crawl.
  
  === Unit tests work in eclipse but fail when running ant in the command line 
===
  Suppose your unit tests work perfectly in eclipse, but each and everyone fail 
when running '''ant test''' in the command line - including the ones you 
haven't modified.   Check if you defined the '''plugin.folders''' property in 
hadoop-site.xml. In that case, try removing it from that file and adding it 
directly to nutch-site.xml
@@ -235, +237 @@

  
  Original credits: RenaudRichardet
  
+ Updated by: Zeeshan
+ 


[Nutch Wiki] Update of Support by OtisGospodnetic

2010-01-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The Support page has been changed by OtisGospodnetic.
http://wiki.apache.org/nutch/Support?action=diffrev1=47rev2=48

--

* [[http://www.ingate.de|INGATE GmbH]]
* [[http://www.intrafind.de|IntraFind Software AG]]
* Michael Rosset mrosset at btmeta.com
+   * [[http://sematext.com/|Sematext]] (Otis Gospodnetic, Lucene in Action and 
Solr in Action co-author) - Solr, Lucene, Nutch, Hadoop, HBase, EC2.  
Lucene/Solr [[http://sematext.com/services/tech-support.html|tech support]], 
development and [[http://sematext.com/services/index.html|consulting 
services]], and [[http://sematext.com/products/index.html|search products]].  
Presence in North America and Europe.
* Supreet Sethi supreet at linux-delhi.org (india preferred)
* Sudhi Seshachala sudhi_...@yahoo.com Please visit 
http://www.myopensourcejobs.com (Built on LAMP and Nutch)
* http://www.termindoc.de (SP data GmbH, Germany schackenberg at 
termindoc.de)


Page search2.net deleted from Nutch Wiki

2010-01-26 Thread Apache Wiki
Dear wiki user,

You have subscribed to a wiki page Nutch Wiki for change notification.

The page search2.net has been deleted by search2.net.
The comment on this change is: empty page.
http://wiki.apache.org/nutch/search2.net


[Nutch Wiki] Update of FrontPage by JohnWhelan

2010-01-25 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FrontPage page has been changed by JohnWhelan.
The comment on this change is: Changes to Cygwin mount points have broken the 
WhelanLabs Search Engine Manager. No new version is planned..
http://wiki.apache.org/nutch/FrontPage?action=diffrev1=128rev2=129

--

   * [[http://blog.foofactory.fi/|FooFactory]] Nutch and Hadoop related posts
   * [[http://spinn3r.com|Spinn3r]] [[http://spinn3r.com/opensource.php|Open 
Source components]] (our contribution to the crawling OSS community with more 
to come).
   * [[http://www.interadvertising.co.uk/blog/nutch_logos|Larger / better 
quality Nutch logos]] Re-created Nutch logos available in GIF, PNG  EPS in 
resolutions up to 1200 x 449
-  * [[http://www.whelanlabs.com/content/SearchEngineManager.htm|WhelanLabs 
SearchEngine Manager]] An all-in-one, bundled implementation of Nutch, Tomcat, 
and Cygwin, and JRE for Microsoft Windows. Includes an installer and a 
simplified administrative UI.
  


[Nutch Wiki] Update of RunningNutchAndSolr by GeoffBe ntley

2010-01-17 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The RunningNutchAndSolr page has been changed by GeoffBentley.
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diffrev1=28rev2=29

--

  = New in Nutch 1.0-dev =
- Please note that in the nightly version of Apache Nutch there is now a Solr 
integration embedded so you can start to use a lot easier. Just download a 
nightly version from [[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/]].
+ Please note that in the nightly version of Apache Nutch there is now a Solr 
integration embedded so you can start to use a lot easier. Just download a 
nightly version from http://hudson.zones.apache.org/hudson/job/Nutch-trunk/.
  
  = Pre Solr Nutch integration =
- This is just a quick first pass at a guide for getting Nutch running with 
Solr.  I'm sure there are better ways of doing some/all of it, but I'm not 
aware of them.  By all means, please do correct/update this if someone has a 
better idea.  Many thanks to [[http://variogram.com||Brian Whitman at 
Variogr.am]] and [[http://blog.foofactory.fi||Sami Siren at FooFactory]] for 
all the help!  You guys saved me a lot of time! :)
+ This is just a quick first pass at a guide for getting Nutch running with 
Solr.  I'm sure there are better ways of doing some/all of it, but I'm not 
aware of them.  By all means, please do correct/update this if someone has a 
better idea.  Many thanks to http://variogram.com and http://blog.foofactory.fi 
for all the help!  You guys saved me a lot of time! :)
  
  I'm posting it under Nutch rather than Solr on the presumption that people 
are more likely to be learning/using Solr first, then come here looking to 
combine it with Nutch.  I'm going to skip over doing command by command for 
right now.  I'm running/building on Ubuntu 7.10 using Java 1.6.0_05.  I'm 
assuming that the Solr trunk code is checked out into solr-trunk and Nutch 
trunk code is checked out into nutch-trunk.
  
@@ -12, +12 @@

   * apt-get install sun-java6-jdk subversion ant patch unzip
  
  == Steps ==
- 
  The first step to get started is to download the required software 
components, namely Apache Solr and Nutch.
  
  '''1.''' Download Solr version 1.3.0 or LucidWorks for Solr from Download page
@@ -23, +22 @@

  
  '''4.''' Extract the Nutch package   tar xzf apache-nutch-1.0.tar.gz
  
+ '''5.''' Configure Solr For the sake of simplicity we are going to use the 
example configuration of Solr as a base.
- '''5.''' Configure Solr
- For the sake of simplicity we are going to use the example
- configuration of Solr as a base.
  
- '''a.''' Copy the provided Nutch schema from directory
- apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf 
(override the existing file)
+ '''a.''' Copy the provided Nutch schema from directory apache-nutch-1.0/conf 
to directory apache-solr-1.3.0/example/solr/conf (override the existing file)
  
  We want to allow Solr to create the snippets for search results so we need to 
store the content in addition to indexing it:
  
@@ -52, +48 @@

  
  str name=qf
  
- content^0.5 anchor^1.0 title^1.2
+ content^0.5 anchor^1.0 title^1.2 /str
- /str
  
- str name=pf
- content^0.5 anchor^1.5 title^1.2 site^1.5
+ str name=pf content^0.5 anchor^1.5 title^1.2 site^1.5 /str
- /str
  
+ str name=fl url /str
- str name=fl
- url
- /str
  
+ str name=mm 2-1 5-2 690% /str
- str name=mm
- 2lt;-1 5lt;-2 6lt;90%
- /str
  
  int name=ps100/int
  
- bool hl=true/
+ bool name=hltrue/bool
  
  str name=q.alt*:*/str
  
@@ -91, +80 @@

  
  '''6.''' Start Solr
  
+ cd apache-solr-1.3.0/example java -jar start.jar
- cd apache-solr-1.3.0/example
- java -jar start.jar
  
  '''7. Configure Nutch'''
  
  a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s 
contents with the following (we specify our crawler name, active plugins and 
limit maximum url count for single host per run to be 100) :
  
+ ?xml version=1.0? configuration
- ?xml version=1.0?
- configuration
  
  property
  
@@ -109, +96 @@

  
  /property
  
- property
- namegenerate.max.per.host/name
+ property namegenerate.max.per.host/name
  
  value100/value
  
@@ -126, +112 @@

  
  /configuration
  
- 
  '''b.''' Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,replace 
it’s content with following:
  
  -^(https|telnet|file|ftp|mailto):
+ 
-  
- # skip some suffixes
- 
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
+ # skip some suffixes 
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
-  
+ 
- # skip URLs 

[Nutch Wiki] Update of TikaPlugin by JulienNioche

2010-01-11 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The TikaPlugin page has been changed by JulienNioche.
http://wiki.apache.org/nutch/TikaPlugin?action=diffrev1=3rev2=4

--

  = Tika Plugin =
  The Tika plugin in http://issues.apache.org/jira/browse/NUTCH-766 is a first 
attempt at delegating the parsing to Tika instead of having to maintain the 
parser plugins in Nutch. This page will list the differences in coverage or 
functionality between the Tika plugin and the existing Nutch parsers. Tika also 
has more formats not covered by Nutch which are not described here and has a 
more generic capability of representing structured content which can be useful 
for HtmlParseFilters (which are currently limited to HTML content).
  
- '''html''': ?
+ '''html''': comparable
  
  '''js''': ?
  
@@ -21, +21 @@

  
  '''rss''': ?
  
- '''rtf''': comparable
+ '''rtf''': deactivated in Nutch for licensing reasons | works in Tika
  
  '''swf''' : not yet covered in Tika (see 
https://issues.apache.org/jira/browse/TIKA-337)
  


[Nutch Wiki] Update of TikaPlugin by JulienNioche

2010-01-11 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The TikaPlugin page has been changed by JulienNioche.
http://wiki.apache.org/nutch/TikaPlugin?action=diffrev1=4rev2=5

--

  
  '''js''': ?
  
- '''mp3''': ?
+ '''mp3''': Nutch identifies several fields (Title, Album, Artist) whereas 
Tika knows only about Titles, the rest is stored as paragraphs. 
  
  '''msexcel''': comparable (+ Tika able to represent content in structured way 
as XHTML tables which can be useful for HTML parser plugins)
  
@@ -19, +19 @@

  
  '''pdf''': comparable
  
- '''rss''': ?
+ '''rss''': Tika identifies only the Mimetype but does nothing about the 
content
  
  '''rtf''': deactivated in Nutch for licensing reasons | works in Tika
  
  '''swf''' : not yet covered in Tika (see 
https://issues.apache.org/jira/browse/TIKA-337)
  
- '''text''': ?
+ '''text''': comparable
  
  '''zip''': ?
  


[Nutch Wiki] Trivial Update of PublicServers by Geoff reyMcCaleb

2010-01-07 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The PublicServers page has been changed by GeoffreyMcCaleb.
The comment on this change is: Updated description of nsyght.com.
http://wiki.apache.org/nutch/PublicServers?action=diffrev1=72rev2=73

--

  
   * [[http://www.myopensourcejobs.com|MyOpensourcejobs]] A Opensource skills 
jobs site using NUTCH and LAMP basedDRUPAL CMS.
  
-  * [[http://www.nsyght.com|Nsyght.com]] is a social search engine that 
customizes a users search based on their social graph.
+  * [[http://www.nsyght.com|Nsyght.com]] is a real time search and aggregation 
service that leverages users social graph.
  
   * [[http://www.nursewebsearch.com|Nurse Web Search]] - Health Internet 
Search Engine.
  


[Nutch Wiki] Update of FAQ by GodmarBack

2010-01-06 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FAQ page has been changed by GodmarBack.
The comment on this change is: Corrected formatting - the {{{ must be in the 
first column, apparently..
http://wiki.apache.org/nutch/FAQ?action=diffrev1=111rev2=112

--

  There are at least two choices to do that:
  
First you need to copy the .WAR file to the servlet container webapps 
folder.
+ {{{
-  {{{% cp nutch-0.7.war $CATALINA_HOME/webapps/ROOT.war
+% cp nutch-0.7.war $CATALINA_HOME/webapps/ROOT.war
  }}}
  
1) After building your first index, start Tomcat from the index folder.
  Assuming your index is located at /index :
+ {{{
- {{{% cd /index/
+ % cd /index/
- % $CATATALINA_HOME/bin/startup.sh}}}
+ % $CATATALINA_HOME/bin/startup.sh
+ }}}
  '''Now you can search.'''
  
2) After building your first index, start and stop Tomcat which will make 
Tomcat extrat the Nutch webapp. Than you need to edit the nutch-site.xml and 
put in it the location of the index folder.
+ {{{
- {{{% $CATATALINA_HOME/bin/startup.sh
+ % $CATATALINA_HOME/bin/startup.sh
- % $CATATALINA_HOME/bin/shutdown.sh}}}
+ % $CATATALINA_HOME/bin/shutdown.sh
+ }}}
  
+ {{{
- {{{% vi $CATATALINA_HOME/bin/webapps/ROOT/WEB-INF/classes/nutch-site.xml
+ % vi $CATATALINA_HOME/bin/webapps/ROOT/WEB-INF/classes/nutch-site.xml
  ?xml version=1.0?
  ?xml-stylesheet type=text/xsl href=nutch-conf.xsl?
  
@@ -85, +91 @@

  
  /nutch-conf
  
- % $CATATALINA_HOME/bin/startup.sh}}}
+ % $CATATALINA_HOME/bin/startup.sh
+ }}}
  
  === Injecting ===
  
@@ -110, +117 @@

  
You'll need to create a file fetcher.done in the segment directory an 
than: [[http://wiki.apache.org/nutch/bin/nutch_updatedb|updatedb]], 
[[http://wiki.apache.org/nutch/bin/nutch_generate|generate]] and 
[[http://wiki.apache.org/nutch/bin/nutch_fetch|fetch]] .
Assuming your index is at /index
+ {{{ 
-   {{{ % touch /index/segments/2005somesegment/fetcher.done
+ % touch /index/segments/2005somesegment/fetcher.done
  
  % bin/nutch updatedb /index/db/ /index/segments/2005somesegment/
  
  % bin/nutch generate /index/db/ /index/segments/2005somesegment/
  
- % bin/nutch fetch /index/segments/2005somesegment}}}
+ % bin/nutch fetch /index/segments/2005somesegment
+ }}}
  
All the pages that were not crawled will be re-generated for fetch. If 
you fetched lots of pages, and don't want to have to re-fetch them again, this 
is the best way.
  
@@ -146, +155 @@

  
  If you have a fast internet connection ( 10Mb/sec) your bottleneck will 
definitely be in the machine itself (in fact you will need multiple machines to 
saturate the data pipe).  Empirically I have found that the machine works well 
up to about 1000-1500 threads.  
  
- To get this to work on my Linux box I needed to set the ulimit to 65535 
(ulimit -n 65535), and I had to make sure that the DNS server could handle the 
load (we had to speak with our colo to get them to shut off an artifical cap on 
the DNS servers).  Also, in order to get the speed up to a reasonable value, we 
needed to set the maximum fetches per host to 100 (otherwise we get a quick 
start followed by a very long slow tail of fetching).
+ To get this to work on my Linux box I needed to set the ulimit to 65535 
(ulimit -n 65535), and I had to make sure that the DNS server could handle the 
load (we had to speak with our colo to get them to shut off an artificial cap 
on the DNS servers).  Also, in order to get the speed up to a reasonable value, 
we needed to set the maximum fetches per host to 100 (otherwise we get a quick 
start followed by a very long slow tail of fetching).
  
  To other users: please add to this with your own experiences, my own 
experience may be atypical.
  
@@ -208, +217 @@

  +.*
  
3) By default the 
[[http://www.nutch.org/docs/api/net/nutch/protocol/file/package-summary.html|file
 plugin]] is disabled. nutch-site.xml needs to be modified to allow this 
plugin. Add an entry like this:
- 
+ {{{
  property
nameplugin.includes/name

valueprotocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)/value
  /property
+ }}}
  
  Now you can invoke the crawler and index all or part of your disk. The only 
remaining gotcha is that if you use Mozilla it will '''not''' load file: URLs 
from a web paged fetched with http, so if you test with the Nutch web container 
running in Tomcat, annoyingly, as you click on results nothing will happen as 
Mozilla by default does not load file URLs. This is mentioned 
[[http://www.mozilla.org/quality/networking/testing/filetests.html|here]] and 
this behavior may be disabled by a 
[[http://www.mozilla.org/quality/networking/docs/netprefs.html|preference]] 
(see security.checkloaduri). IE5 does not have this problem.
  
   Nutch crawling parent directories for file protocol -  

[Nutch Wiki] Update of FAQ by GodmarBack

2010-01-06 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FAQ page has been changed by GodmarBack.
The comment on this change is: added useful link to Crawling the local 
filesystem page..
http://wiki.apache.org/nutch/FAQ?action=diffrev1=112rev2=113

--

  
  Now you can invoke the crawler and index all or part of your disk. The only 
remaining gotcha is that if you use Mozilla it will '''not''' load file: URLs 
from a web paged fetched with http, so if you test with the Nutch web container 
running in Tomcat, annoyingly, as you click on results nothing will happen as 
Mozilla by default does not load file URLs. This is mentioned 
[[http://www.mozilla.org/quality/networking/testing/filetests.html|here]] and 
this behavior may be disabled by a 
[[http://www.mozilla.org/quality/networking/docs/netprefs.html|preference]] 
(see security.checkloaduri). IE5 does not have this problem.
  
-  Nutch crawling parent directories for file protocol -  misconfigured 
URLFilters 
+  Nutch crawling parent directories for file protocol 
+ 
+ If you find nutch crawling parent directories when using the file protocol, 
the following kludge may help:
+ 
- [[http://issues.apache.org/jira/browse/NUTCH-407]] E.g. for urlfilter-regex 
you should put the following in regex-urlfilter.txt :
+ [[http://issues.apache.org/jira/browse/NUTCH-407]] E.g. for urlfilter-regex 
you could put the following in regex-urlfilter.txt :
  {{{
  +^file:///c:/top/directory/
  -.
  }}}
+ 
+ Alternatively, you could apply the patch described 
[[http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch|on 
this page]], which would avoid the hardwiring of the site-specific 
/top/directory in your configuration file.
  
   How do I index remote file shares? 
  


[Nutch Wiki] Update of FAQ by GodmarBack

2010-01-06 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FAQ page has been changed by GodmarBack.
http://wiki.apache.org/nutch/FAQ?action=diffrev1=113rev2=114

--

  {{{
  property
nameplugin.includes/name
-   
valueprotocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)/value
+   valueprotocol-file|...copy original values from nutch-default 
here.../value
  /property
  }}}
+ 
+ where you should copy and paste all values from nutch-default.xml in the 
plugin.includes setting provided there. This will ensure that all plug-in 
normally enabled will be enabled, plus the protocol-file plugin. Make sure to 
include parse-pdf if you want to parse PDF files. Make sure that 
urlfilter-regexp is included, or else '''the *urlfilter files will be 
ignored''', leading nutch to accept all URLs. You need to enable crawl URL 
filters to prevent nutch from crawling up the parent directory, see below.
  
  Now you can invoke the crawler and index all or part of your disk. The only 
remaining gotcha is that if you use Mozilla it will '''not''' load file: URLs 
from a web paged fetched with http, so if you test with the Nutch web container 
running in Tomcat, annoyingly, as you click on results nothing will happen as 
Mozilla by default does not load file URLs. This is mentioned 
[[http://www.mozilla.org/quality/networking/testing/filetests.html|here]] and 
this behavior may be disabled by a 
[[http://www.mozilla.org/quality/networking/docs/netprefs.html|preference]] 
(see security.checkloaduri). IE5 does not have this problem.
  


[Nutch Wiki] Update of FAQ by GodmarBack

2010-01-06 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FAQ page has been changed by GodmarBack.
The comment on this change is: Fixed erroneous instructions for how to include 
protocol-file.
http://wiki.apache.org/nutch/FAQ?action=diffrev1=114rev2=115

--

  /property
  }}}
  
- where you should copy and paste all values from nutch-default.xml in the 
plugin.includes setting provided there. This will ensure that all plug-in 
normally enabled will be enabled, plus the protocol-file plugin. Make sure to 
include parse-pdf if you want to parse PDF files. Make sure that 
urlfilter-regexp is included, or else '''the *urlfilter files will be 
ignored''', leading nutch to accept all URLs. You need to enable crawl URL 
filters to prevent nutch from crawling up the parent directory, see below.
+ where you should copy and paste all values from nutch-default.xml in the 
plugin.includes setting provided there. This will ensure that all plug-ins 
normally enabled will be enabled, plus the protocol-file plugin. Make sure to 
include parse-pdf if you want to parse PDF files. Make sure that 
urlfilter-regexp is included, or else '''the *urlfilter files will be 
ignored''', leading nutch to accept all URLs. You need to enable crawl URL 
filters to prevent nutch from crawling up the parent directory, see below.
  
  Now you can invoke the crawler and index all or part of your disk. The only 
remaining gotcha is that if you use Mozilla it will '''not''' load file: URLs 
from a web paged fetched with http, so if you test with the Nutch web container 
running in Tomcat, annoyingly, as you click on results nothing will happen as 
Mozilla by default does not load file URLs. This is mentioned 
[[http://www.mozilla.org/quality/networking/testing/filetests.html|here]] and 
this behavior may be disabled by a 
[[http://www.mozilla.org/quality/networking/docs/netprefs.html|preference]] 
(see security.checkloaduri). IE5 does not have this problem.
  


[Nutch Wiki] Update of search2.net by search2.net

2009-12-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The search2.net page has been changed by search2.net.
http://wiki.apache.org/nutch/search2.net

--

New page:
##language:en
== search2.net ==

 * [[http://search2.net/|search2.net]]




[Nutch Wiki] Update of PublicServers by search2.net

2009-12-26 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The PublicServers page has been changed by search2.net.
http://wiki.apache.org/nutch/PublicServers?action=diffrev1=71rev2=72

--

  
   * [[http://www.gouv.qc.ca/|Government of Quebec websites]] Over 400 websites 
of the government of Quebec (Canada) are indexed by Nutch. The Web application 
has been developped by [[http://www.doculibre.com/index_en.html/|Doculibre 
inc.]]
  
-  * [[http://search2.net/|search2.net]] is a search engine based on Nutch.
+  * [[http://search2.net/|search2.net]] General search engine with an 
international index based on Nutch.
   * [[http://www.searchmitchell.com/|SearchMitchell.com]] is a community 
search engine for businesses and organizations in Mitchell, SD.
  
   * [[http://www.umkreisfinder.de/|UmkreisFinder.de]] is running the 
GeoPosition plugin for local searches in Germany and in German. Please insert a 
search term in the first field, a German city name in the second field and 
choose a perimeter at the last field.


[Nutch Wiki] Update of PublicServers by RBalmes

2009-12-25 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The PublicServers page has been changed by RBalmes.
http://wiki.apache.org/nutch/PublicServers?action=diffrev1=71rev2=72

--

  = Public search engines using Nutch =
+ 
  Please sort by name alphabetically
  
-  * [[http://askaboutoil.com|AskAboutOil]] is a vertical search portal for the 
petroleum industry.
+   * [[http://askaboutoil.com|AskAboutOil]] is a vertical search portal for 
the petroleum industry.
  
-  * [[http://www.asbestosinfo.info|Asbestos]] is a vertical search portal and 
discussion forum for the asbestos and related information.
+   * [[http://www.asbestosinfo.info|Asbestos]] is a vertical search portal and 
discussion forum for the asbestos and related information.
  
-  * [[http://www.baynote.com/go|Baynote]] provides free hosted Nutch search 
for businesses.
+   * [[http://www.baynote.com/go|Baynote]] provides free hosted Nutch search 
for businesses.
  
-  * [[http://betherebesquare.com|BeThere BeSquare]] is an Event Search Engine 
for the San Francisco Bay Area that allows users to specify keywords, date, 
city, address, and category and get details about events in 4 different views.
+   * [[http://betherebesquare.com|BeThere BeSquare]] is an Event Search Engine 
for the San Francisco Bay Area that allows users to specify keywords, date, 
city, address, and category and get details about events in 4 different views.
  
-  * [[http://www.bible-ref.om/|Biible]] is the first biblical search engine 
that allows people to search the web for comments of biblical verse or range of 
verse. 6 major languages are fully recognized and 150 partially for now. Based 
on Nutch.
+   * [[http://www.bigsearch.ca/|Bigsearch.ca]] uses nutch open source software 
to deliver its search results.
  
-  * [[http://www.bigsearch.ca/|Bigsearch.ca]] uses nutch open source software 
to deliver its search results.
+   * [[http://busytonight.com/|BusyTonight]]: Search for any event in the 
United States, by keyword, location, and date. Event listings are automatically 
crawled and updated from original source Web sites.
  
-  * [[http://busytonight.com/|BusyTonight]]: Search for any event in the 
United States, by keyword, location, and date. Event listings are automatically 
crawled and updated from original source Web sites.
+   * [[http://www.centralbudapest.com/search|Central Budapest Search]] is a 
search engine for English language sites focussing on Budapest news, 
restaurants, accommodation, life and events.
  
-  * [[http://www.centralbudapest.com/search|Central Budapest Search]] is a 
search engine for English language sites focussing on Budapest news, 
restaurants, accommodation, life and events.
+   * [[http://circuitscout.com|Circuit Scout]] is a search engine for 
electrical circuits.
  
-  * [[http://circuitscout.com|Circuit Scout]] is a search engine for 
electrical circuits.
+   * [[http://www.comtecsearch.com|Comtec Search]] is a search engine for UK 
Tour Operator Package Holiday Brochures.
  
-  * [[http://www.comtecsearch.com|Comtec Search]] is a search engine for UK 
Tour Operator Package Holiday Brochures.
+   * [[http://www.coder-suche.de|Coder-Suche.de]] searchs for coding stuff 
like apis, documentations, tutorials, openBooks and more. Its origin is german, 
its contents are mainly english.
  
-  * [[http://www.coder-suche.de|Coder-Suche.de]] searchs for coding stuff like 
apis, documentations, tutorials, openBooks and more. Its origin is german, its 
contents are mainly english.
+   * [[http://campusgw.library.cornell.edu/|Cornell University Library]] is 
collaborating with the research group of Thorsten Joachims to develop a 
learning search engine for library web pages based on Nutch. The nutch-based 
search engine is near the bottom of the page.
  
-  * [[http://campusgw.library.cornell.edu/|Cornell University Library]] is 
collaborating with the research group of Thorsten Joachims to develop a 
learning search engine for library web pages based on Nutch. The nutch-based 
search engine is near the bottom of the page.
+   * [[http://search.creativecommons.org/|Creative Commons]] is a search 
engine for creative commons licensed material.
  
-  * [[http://search.creativecommons.org/|Creative Commons]] is a search engine 
for creative commons licensed material.
+   * [[http://www.dadi360.com/|Dadi360]] Usee nutch search engine for 
providing search of Chinese language websites in North America.
  
-  * [[http://www.dadi360.com/|Dadi360]] Usee nutch search engine for providing 
search of Chinese language websites in North America.
+   * [[http://www.ecolicommunity.org/Websearch|Ecolhub Web Search]] an E. coli 
specific search engine based on Nutch. EcoliHub WebSearch includes only those 
sites relevant to E. coli, thereby reducing the number of spurious hits. 
Searches can be optionally limited to your choice of resources. More 

[Nutch Wiki] Update of PublicServers by search2.net

2009-12-24 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The PublicServers page has been changed by search2.net.
http://wiki.apache.org/nutch/PublicServers?action=diffrev1=70rev2=71

--

  
* [[http://www.gouv.qc.ca/|Government of Quebec websites]] Over 400 
websites of the government of Quebec (Canada) are indexed by Nutch. The Web 
application has been developped by 
[[http://www.doculibre.com/index_en.html/|Doculibre inc.]]
  
-   * [[http://search2.net/|search2.net]] is a search engine based on Nutch.
+   * [[http://search2.net/|search2.net]] General search engine with an 
international index.
* [[http://www.searchmitchell.com/|SearchMitchell.com]] is a community 
search engine for businesses and organizations in Mitchell, SD.
  
* [[http://www.umkreisfinder.de/|UmkreisFinder.de]] is running the 
[[GeoPosition]] plugin for local searches in Germany and in German. Please 
insert a search term in the first field, a German city name in the second field 
and choose a perimeter at the last field.


[Nutch Wiki] Update of PublicServers by search2.net

2009-12-24 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The PublicServers page has been changed by search2.net.
http://wiki.apache.org/nutch/PublicServers?action=diffrev1=71rev2=72

--

  
* [[http://www.gouv.qc.ca/|Government of Quebec websites]] Over 400 
websites of the government of Quebec (Canada) are indexed by Nutch. The Web 
application has been developped by 
[[http://www.doculibre.com/index_en.html/|Doculibre inc.]]
  
-   * [[http://search2.net/|search2.net]] General search engine with an 
international index.
+   * [[http://search2.net/|search2.net]] is a general search engine with an 
international index.
* [[http://www.searchmitchell.com/|SearchMitchell.com]] is a community 
search engine for businesses and organizations in Mitchell, SD.
  
* [[http://www.umkreisfinder.de/|UmkreisFinder.de]] is running the 
[[GeoPosition]] plugin for local searches in Germany and in German. Please 
insert a search term in the first field, a German city name in the second field 
and choose a perimeter at the last field.


[Nutch Wiki] Update of TikaPlugin by JulienNioche

2009-12-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The TikaPlugin page has been changed by JulienNioche.
http://wiki.apache.org/nutch/TikaPlugin?action=diffrev1=2rev2=3

--

  = Tika Plugin =
- The Tika plugin in http://issues.apache.org/jira/browse/NUTCH-766 is a first 
attempt at delegating the parsing to Tika instead of having to maintain the 
parser plugins in Nutch. This page will list the differences in coverage or 
functionality between the Tika plugin and the existing Nutch parsers. Tika also 
has more formats not covered by Nutch which are not described here.
+ The Tika plugin in http://issues.apache.org/jira/browse/NUTCH-766 is a first 
attempt at delegating the parsing to Tika instead of having to maintain the 
parser plugins in Nutch. This page will list the differences in coverage or 
functionality between the Tika plugin and the existing Nutch parsers. Tika also 
has more formats not covered by Nutch which are not described here and has a 
more generic capability of representing structured content which can be useful 
for HtmlParseFilters (which are currently limited to HTML content).
  
  '''html''': ?
  
@@ -9, +9 @@

  
  '''mp3''': ?
  
- '''msexcel''': ?
+ '''msexcel''': comparable (+ Tika able to represent content in structured way 
as XHTML tables which can be useful for HTML parser plugins)
  
- '''mspowerpoint''': ?
+ '''mspowerpoint''': comparable
  
- '''msword''': ?
+ '''msword''': Tika does not support word 95 other versions are comparable
  
- '''openoffice''': ?
+ '''openoffice''': comparable
  
- '''pdf''': ?
+ '''pdf''': comparable
  
  '''rss''': ?
  
- '''rtf''': ?
+ '''rtf''': comparable
  
  '''swf''' : not yet covered in Tika (see 
https://issues.apache.org/jira/browse/TIKA-337)
  
  '''text''': ?
  
- '''zip''': ?not covered in Tika
+ '''zip''': ?
  


[Nutch Wiki] Update of TikaPlugin by JulienNioche

2009-12-15 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The TikaPlugin page has been changed by JulienNioche.
http://wiki.apache.org/nutch/TikaPlugin?action=diffrev1=1rev2=2

--

- =Tika Plugin=
+ = Tika Plugin =
+ The Tika plugin in http://issues.apache.org/jira/browse/NUTCH-766 is a first 
attempt at delegating the parsing to Tika instead of having to maintain the 
parser plugins in Nutch. This page will list the differences in coverage or 
functionality between the Tika plugin and the existing Nutch parsers. Tika also 
has more formats not covered by Nutch which are not described here.
  
+ '''html''': ?
- The Tika plugin in http://issues.apache.org/jira/browse/NUTCH-766 is a first 
attempt at delegating the parsing to Tika instead of having to maintain the 
parser plugins in Nutch.
- This page will list the differences in coverage or functionality between the 
Tika plugin and the existing Nutch parsers.
  
+ '''js''': ?
+ 
+ '''mp3''': ?
+ 
+ '''msexcel''': ?
+ 
+ '''mspowerpoint''': ?
+ 
+ '''msword''': ?
+ 
+ '''openoffice''': ?
+ 
+ '''pdf''': ?
+ 
+ '''rss''': ?
+ 
+ '''rtf''': ?
+ 
+ '''swf''' : not yet covered in Tika (see 
https://issues.apache.org/jira/browse/TIKA-337)
+ 
+ '''text''': ?
+ 
+ '''zip''': ?not covered in Tika
+ 


[Nutch Wiki] Update of FrontPage by JulienNioche

2009-12-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FrontPage page has been changed by JulienNioche.
http://wiki.apache.org/nutch/FrontPage?action=diffrev1=127rev2=128

--

   * JavaDemoApplication - A simple demonstration of how to use the Nutch APIin 
a Java application
   * InstallingWeb2
   * ApacheConUs2009MeetUp - List of topics for !MeetUp at !ApacheCon US 2009 
in Oakland (Nov 2-6)
+  * TikaPlugin - Comments on the Tika integration and differences with 
existing parse plugins
  
  == Nutch 2.0 ==
   * Nutch2Architecture -- Discussions on the Nutch 2.0 architecture.


[Nutch Wiki] Trivial Update of Automating_Fetches_wi th_Python by newacct

2009-11-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The Automating_Fetches_with_Python page has been changed by newacct.
http://wiki.apache.org/nutch/Automating_Fetches_with_Python?action=diffrev1=5rev2=6

--

  import sys
  import getopt
  import re
- import string
  import logging
  import logging.config
  import commands
@@ -259, +258 @@

total_urls += 1
  urllinecount.close()
  numsplits = total_urls / splitsize
- padding = 0 * len(`numsplits`)
+ padding = 0 * len(repr(numsplits))
  
  # create the url load folder
- linenum = 0
  filenum = 0
- strfilenum = `filenum`
+ strfilenum = repr(filenum)
  urloutdir = outdir + /urls- + padding[len(strfilenum):] + strfilenum
  os.mkdir(urloutdir)
  urlfile = urloutdir + /urls
@@ -275, +273 @@

  outhandle = open(urlfile, w)
  
  # loop through the file
- for line in inhandle:
+ for linenum, line in enumerate(inhandle):
  
# if we have come to a split then close the current file, create a new
# url folder and open a new url file
-   if linenum  0 and (linenum % splitsize == 0):
+   if linenum  0 and linenum % splitsize == 0:
  
- filenum = filenum + 1
+ filenum += 1
- strfilenum = `filenum`
+ strfilenum = repr(filenum)
  urloutdir = outdir + /urls- + padding[len(strfilenum):] + strfilenum
  os.mkdir(urloutdir)
  urlfile = urloutdir + /urls
@@ -290, +288 @@

  outhandle.close()
  outhandle = open(urlfile, w)
  
-   # write the url to the file and increase the number of lines read
+   # write the url to the file
outhandle.write(line)
-   linenum = linenum + 1
  
  # close the input and output files
  inhandle.close()
@@ -362, +359 @@

  
  # fetch the current segment
  outar = result[1].splitlines()
- output = outar[len(outar) - 1]
+ output = outar[-1]
- tempseg = string.split(output)[0]
+ tempseg = output.split()[0]
  tempseglist.append(tempseg)
  fetch = self.nutchdir + /bin/nutch fetch  + tempseg
  self.log.info(Starting fetch for:  + tempseg)
@@ -392, +389 @@

  
# merge the crawldbs
self.log.info(Merging master and temp crawldbs.)
-   crawlmerge = (self.nutchdir + /bin/nutch mergedb mergetemp/crawldb  +
+   crawlmerge = self.nutchdir + /bin/nutch mergedb mergetemp/crawldb  + \
- mastercrawldbdir +   + string.join(tempdblist,  ))
+ mastercrawldbdir +   +  .join(tempdblist)
self.log.info(Running:  + crawlmerge)
result = commands.getstatusoutput(crawlmerge)
self.checkStatus(result, Error occurred while running command + 
crawlmerge)
@@ -404, +401 @@

result = commands.getstatusoutput(getsegment)
self.checkStatus(result, Error occurred while running command + 
getsegment)
outar = result[1].splitlines()
-   output = outar[len(outar) - 1]
+   output = outar[-1]
-   masterseg = string.split(output)[0]
+   masterseg = output.split()[0]
-   mergesegs = (self.nutchdir + /bin/nutch mergesegs mergetemp/segments  
+
+   mergesegs = self.nutchdir + /bin/nutch mergesegs mergetemp/segments  
+ \
- masterseg +   + string.join(tempseglist,  ))
+ masterseg +   +  .join(tempseglist)
self.log.info(Running:  + mergesegs)
result = commands.getstatusoutput(mergesegs)
self.checkStatus(result, Error occurred while running command + 
mergesegs)
@@ -464, +461 @@

usage.append([-b | --backupdir] The master backup directory, 
[crawl-backup].\n)
usage.append([-s | --splitsize] The number of urls per load 
[50].\n)
usage.append([-f | --fetchmerge] The number of fetches to run 
before merging [1].\n)
-   message = string.join(usage)
+   message =  .join(usage)
print message
  
  


[Nutch Wiki] Update of FrontPage by Davinder

2009-11-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FrontPage page has been changed by Davinder.
http://wiki.apache.org/nutch/FrontPage?action=diffrev1=123rev2=124

--

   * [[http://www.interadvertising.co.uk/blog/nutch_logos|Larger / better 
quality Nutch logos]] Re-created Nutch logos available in GIF, PNG  EPS in 
resolutions up to 1200 x 449
   * [[http://www.whelanlabs.com/content/SearchEngineManager.htm|WhelanLabs 
SearchEngine Manager]] An all-in-one, bundled implementation of Nutch, Tomcat, 
and Cygwin, and JRE for Microsoft Windows. Includes an installer and a 
simplified administrative UI.
  
+ * [[http://videolectures.net/iiia06_cutting_ense/| Experiences with the Nutch 
search engine
+ author: Doug Cutting, Yahoo! Research ]]Experiences with the Nutch search 
engine
+ author: Doug Cutting, Yahoo! Research .
+ 


[Nutch Wiki] Update of FrontPage by Davinder

2009-11-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FrontPage page has been changed by Davinder.
http://wiki.apache.org/nutch/FrontPage?action=diffrev1=124rev2=125

--

  Please contribute your knowledge about Nutch here!
  
  == General Information ==
-  * [[http://www.nutch.org|Nutch Website ]]
+  * [[http://www.nutch.org|Nutch Website]]
   * [[Features]]
   * PublicServers running Nutch
   * [[Presentations]] on Nutch
@@ -51, +51 @@

   * [[RunNutchInEclipse1.0]] for v1.0 (Linux and Windows)
   * [[Crawl]] - script to crawl (and possible recrawl too)
   * IntranetRecrawl - script to recrawl a crawl
-  * MergeCrawl - script to merge 2 (or more) crawls 
+  * MergeCrawl - script to merge 2 (or more) crawls
   * SearchOverMultipleIndexes - configuring nutch to enable searching over 
multiple indexes
   * CrossPlatformNutchScripts
   * MonitoringNutchCrawls - techniques for keeping an eye on a nutch crawl's 
progress.
@@ -78, +78 @@

   * [[Website_Update_HOWTO]]
   * [[Image_Search_Design]]
   * [[NutchOSGi]]
-  * [[StrategicGoals]]
+  * StrategicGoals
-  * [[IndexStructure]]
+  * IndexStructure
   * [[Getting_Started]]
   * JavaDemoApplication - A simple demonstration of how to use the Nutch APIin 
a Java application
   * InstallingWeb2
   * ApacheConUs2009MeetUp - List of topics for !MeetUp at !ApacheCon US 2009 
in Oakland (Nov 2-6)
  
  == Nutch 2.0 ==
-  * [[Nutch2Architecture]] -- Discussions on the Nutch 2.0 architecture.
+  * Nutch2Architecture -- Discussions on the Nutch 2.0 architecture.
-  * [[NewScoring]] -- New stable pagerank like webgraph and link-analysis jobs.
+  * NewScoring -- New stable pagerank like webgraph and link-analysis jobs.
-  * [[NewScoringIndexingExample]] -- Two full fetch cycles of commands using 
new scoring and indexing systems.
+  * NewScoringIndexingExample -- Two full fetch cycles of commands using new 
scoring and indexing systems.
  
  == Other Resources ==
   * [[http://nutch.sourceforge.net/blog/cutting.html|Doug's Weblog]] -- He's 
the one who originally wrote Lucene and Nutch.
@@ -96, +96 @@

   * [[http://frutch.free.fr/wikini/|Frutch Wiki]] -- French Nutch Wiki
   * The [[http://nutch.sourceforge.net/cgi-bin/twiki/view/Main/Nutch|Old Wiki]]
   * [[Search_Theory]] Search Theory  White Papers
-  * 
[[http://wiki.apache.org/nutch-data/attachments/FrontPage/attachments/Hadoop-Nutch%200.8%20Tutorial%2022-07-06%20%3CNavoni%20Roberto%3E|Tutorial
 Hadoop+Nutch 0.8 night build Roberto Navoni 24-07-06]]
+  * 
[[http://wiki.apache.org/nutch-data/attachments/FrontPage/attachments/Hadoop-Nutch%200.8%20Tutorial%2022-07-06%20Navoni%20Roberto|Tutorial
 Hadoop+Nutch 0.8 night build Roberto Navoni 24-07-06]]
   * [[http://blog.foofactory.fi/|FooFactory]] Nutch and Hadoop related posts
   * [[http://spinn3r.com|Spinn3r]] [[http://spinn3r.com/opensource.php|Open 
Source components]] (our contribution to the crawling OSS community with more 
to come).
   * [[http://www.interadvertising.co.uk/blog/nutch_logos|Larger / better 
quality Nutch logos]] Re-created Nutch logos available in GIF, PNG  EPS in 
resolutions up to 1200 x 449
   * [[http://www.whelanlabs.com/content/SearchEngineManager.htm|WhelanLabs 
SearchEngine Manager]] An all-in-one, bundled implementation of Nutch, Tomcat, 
and Cygwin, and JRE for Microsoft Windows. Includes an installer and a 
simplified administrative UI.
+  *  http://videolectures.net/iiia06_cutting_ense/| Experiences with the Nutch 
search engine author: Doug Cutting,Video  Lecture
  
- * [[http://videolectures.net/iiia06_cutting_ense/| Experiences with the Nutch 
search engine
- author: Doug Cutting, Yahoo! Research ]]Experiences with the Nutch search 
engine
- author: Doug Cutting, Yahoo! Research .
- 


[Nutch Wiki] Update of FrontPage by Davinder

2009-11-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FrontPage page has been changed by Davinder.
http://wiki.apache.org/nutch/FrontPage?action=diffrev1=125rev2=126

--

   * [[http://spinn3r.com|Spinn3r]] [[http://spinn3r.com/opensource.php|Open 
Source components]] (our contribution to the crawling OSS community with more 
to come).
   * [[http://www.interadvertising.co.uk/blog/nutch_logos|Larger / better 
quality Nutch logos]] Re-created Nutch logos available in GIF, PNG  EPS in 
resolutions up to 1200 x 449
   * [[http://www.whelanlabs.com/content/SearchEngineManager.htm|WhelanLabs 
SearchEngine Manager]] An all-in-one, bundled implementation of Nutch, Tomcat, 
and Cygwin, and JRE for Microsoft Windows. Includes an installer and a 
simplified administrative UI.
-  *  http://videolectures.net/iiia06_cutting_ense/| Experiences with the Nutch 
search engine author: Doug Cutting,Video  Lecture
+  * http://videolectures.net/iiia06_cutting_ense/| Experiences with the Nutch 
search engine author: Doug Cutting,Video Lecture
  


[Nutch Wiki] Update of FrontPage by Davinder

2009-11-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FrontPage page has been changed by Davinder.
http://wiki.apache.org/nutch/FrontPage?action=diffrev1=126rev2=127

--

   * Commercial [[Support]] and developers for hire
   * [[Mailing]] Lists
   * AcademicArticles that deal with Nutch
+  * http://videolectures.net/iiia06_cutting_ense/| Experiences with the Nutch 
search engine author:Doug   Cutting,Video Lecture
+ 
  
  == Nutch Administration ==
   * DownloadingNutch
@@ -101, +103 @@

   * [[http://spinn3r.com|Spinn3r]] [[http://spinn3r.com/opensource.php|Open 
Source components]] (our contribution to the crawling OSS community with more 
to come).
   * [[http://www.interadvertising.co.uk/blog/nutch_logos|Larger / better 
quality Nutch logos]] Re-created Nutch logos available in GIF, PNG  EPS in 
resolutions up to 1200 x 449
   * [[http://www.whelanlabs.com/content/SearchEngineManager.htm|WhelanLabs 
SearchEngine Manager]] An all-in-one, bundled implementation of Nutch, Tomcat, 
and Cygwin, and JRE for Microsoft Windows. Includes an installer and a 
simplified administrative UI.
-  * http://videolectures.net/iiia06_cutting_ense/| Experiences with the Nutch 
search engine author: Doug Cutting,Video Lecture
  


[Nutch Wiki] Update of OptimizingCrawls by DennisKube s

2009-11-24 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The OptimizingCrawls page has been changed by DennisKubes.
The comment on this change is: Page about optimizing crawling speed.
http://wiki.apache.org/nutch/OptimizingCrawls

--

New page:
'''Here are the things that could potentially slow down fetching'''

1) DNS setup

2) The number of crawlers you have, too many, too few.

3) Bandwidth limitations

4) Number of threads per host (politeness)

5) Uneven distribution of urls to fetch and politeness.

6) High crawl-delays from robots.txt (usually along with an uneven distribution 
of urls).

7) Many slow websites (again usually with an uneven distribution).

8) Downloading lots of content (PDFS, very large html pages, again possibly an 
uneven distribution).

9) Others

'''Now how do we fix them'''

1) Have a DNS setup on each local crawling machine, if multiple crawling 
machines and a single centralized DNS it can act like a DOS attack on the DNS 
server slowing the entire system.  We always did a two layer setup hitting 
first to the local DNS cache then to a large DNS cache like OpenDNS or Verizon.

2) This would be number of map tasks * fetcher.threads.fetch.  So 10 map tasks 
* 20 threads = 200 fetchers at once.  Too many and you overload your system, 
too few and other factors and the machine sites idle.  You will need to play 
around with this setting for your setup.

3) Bandwidth limitations.  Use ntop, ganglia, and other monitoring tools  to 
determine how much bandwidth you are using.  Account for in and out bandwidth.  
A simple test, from a server inside the fetching network but not itself 
fetching, if it is very slow connecting to or downloading content when fetching 
is occurring, it is a good bet you are maxing out bandwidth.  If you set http 
timeout as we describe later and are maxing your bandwidth, you will start 
seeing many http timeout errors.

4) Politeness along with uneven distribution of urls is probably the biggest 
limiting factor.  If one thread is processing a single site and there are a lot 
of urls from that site to fetch all other threads will sit idle while that one 
thread finishes.  Some solutions, use fetcher.server.delay to shorten the time 
between page fetches and use fetcher.threads.per.host to increase the number of 
threads fetching for a single site (this would still be in the same map task 
though and hence the same JVM ChildTask process).  If increasing this  0 you 
could also set fetcher.server.min.delay to some value  0 for politeness to min 
and max bound the process.

5) Fetching a lot of pages from a single site or a lot of pages from a few 
sites will slow down fetching dramatically.  For full web crawls you want an 
even distribution so all fetching threads can be active.  Setting 
generate.max.per.host to a value  0 will limit the number of pages from a 
single host/domain to fetch.

6) Crawl-delay can be used and is obeyed by nutch in robots.txt.  Most sites 
don't use this setting but a few (some malicious do).  I have seen crawl-delays 
as high as 2 days in seconds.  The fetcher.max.crawl.delay variable will ignore 
pages with crawl delays  x.  I usually set this to 10 seconds, default is 30.  
Even at 10 seconds if you have a lot of pages from a site from which you can 
only crawl 1 page every 10 seconds it is going to be slow.  On the flip side, 
setting this to a low value will ignore and not fetch those pages.

7) Sometimes, manytimes websites are just slow.  Setting a low value for 
http.timeout helps.  The default is 10 seconds.  If you don't care and want as 
many pages as fast as possible, set it lower.  Some websites, digg for 
instance, will bandwidth limit you on their side only allowing x connections 
per given time frame.  So even if you only have say 50 pages from a single site 
(which I still think is to many).  It may be waiting 10 seconds on each page.  
The ftp.timeout can also be set if fetching ftp content. 8) Lots of content 
means slower fetching.  If downloading PDFs and other non-html documents this 
is especially true.  To avoid non-html content you can use the url filters.  I 
prefer the prefix and suffix filters.  The http.content.limit and 
ftp.content.limit can be used to limit the amount of content downloaded for a 
single document.

9) Other things that could be causing slow fetching:

 * Max the number of open sockets/files on a machine.  You will start seeing IO 
errors or can't open socket errors.
 * Poor routing.  Bad routers or home routers might not be able to handle the 
number of connections going through at once.  An incorrect routing setup could 
also be causing problems but those are usually much more complex to diagnose.  
Use network trace and mapping tools if you think this is happening.  Upstream 
routing can also be a problem from your network provider.
 * Bad network cards.  I have seen network 

[Nutch Wiki] Update of FrontPage by DennisKubes

2009-11-24 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FrontPage page has been changed by DennisKubes.
http://wiki.apache.org/nutch/FrontPage?action=diffrev1=122rev2=123

--

   * NonDefaultIntranetCrawlingOptions - Desirable options to add to your 
intranet crawling configuration.
   * RunningNutchAndSolr - How to configure Nutch to crawl, but post to Solr 
for search/index
   * NutchWithChineseAnalyzer - References to some Chinese articles explaining 
how to setup Nutch with 3rd party Chinese analyzers
+  * OptimizingCrawls - How to optimize your crawling/fetching speed with Nutch.
  
  == Nutch Development ==
   * [[Becoming_A_Nutch_Developer|Becoming a Nutch Developer]] - Start 
developing and contributing to Nutch.


[Nutch Wiki] Update of NutchHadoopTutorial by ilgiz

2009-11-18 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The NutchHadoopTutorial page has been changed by ilgiz.
The comment on this change is: max outlinks per page.
http://wiki.apache.org/nutch/NutchHadoopTutorial?action=diffrev1=14rev2=15

--


* This tutorial worked well for me, however, I ran into a problem where my 
crawl wasn't working.   Turned out, it was because I needed to set the user 
agent and other properties for the crawl.  If anyone is reading this, and 
running into the same problem, look at the updated tutorial 
http://wiki.apache.org/nutch/Nutch0%2e9-Hadoop0%2e10-Tutorial?highlight=%28hadoop%29%7C%28tutorial%29
  
+ 
+ 
+ * By default Nutch will read only the first 100 links on a page.  This will 
result in incomplete indexes when scanning file trees.  So I set the max 
outlinks per page option to -1 in nutch-site.conf and got complete indexes.
+ {{{
+ property
+   namedb.max.outlinks.per.page/name
+   value-1/value
+   descriptionThe maximum number of outlinks that we'll process for a page.
+   If this value is nonnegative (=0), at most db.max.outlinks.per.page 
outlinks
+   will be processed for a page; otherwise, all outlinks will be processed.
+   /description
+ /property
+ }}}
+ 


[Nutch Wiki] Trivial Update of NutchHadoopTutorial by ilgiz

2009-11-18 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The NutchHadoopTutorial page has been changed by ilgiz.
http://wiki.apache.org/nutch/NutchHadoopTutorial?action=diffrev1=15rev2=16

--

  
  
  
- * By default Nutch will read only the first 100 links on a page.  This will 
result in incomplete indexes when scanning file trees.  So I set the max 
outlinks per page option to -1 in nutch-site.conf and got complete indexes.
+   * By default Nutch will read only the first 100 links on a page.  This will 
result in incomplete indexes when scanning file trees.  So I set the max 
outlinks per page option to -1 in nutch-site.conf and got complete indexes.
  {{{
  property
namedb.max.outlinks.per.page/name


[Nutch Wiki] Update of RunNutchInEclipse1.0 by Anas Elghafari

2009-11-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The RunNutchInEclipse1.0 page has been changed by AnasElghafari.
http://wiki.apache.org/nutch/RunNutchInEclipse1.0?action=diffrev1=12rev2=13

--

   * Name the project (Nutch_Trunk for instance)
   * Select Create project from existing source and use the location where 
you downloaded Nutch
   * Click on Next, and wait while Eclipse is scanning the folders
-  * Add the folder conf to the classpath (click the Libraries tab, click 
Add Class Folder... button, and select conf from the list) 
+  * Add the folder conf to the classpath (Right-click on the project, select 
properties then Java Build Path tab (left menu) and then the Libraries 
tab. Click Add Class Folder... button, and select conf from the list) 
   * Go to Order and Export tab, find the entry for added conf folder and 
move it to the top (by checking it and clicking the Top button). This is 
required so Eclipse will take config (nutch-default.xml, nutch-final.xml, etc.) 
resources from our conf folder and not from somewhere else.
   * Eclipse should have guessed all the Java files that must be added to your 
classpath. If that's not the case, add src/java, src/test and all plugin 
src/java and src/test folders to your source folders. Also add all jars in 
lib and in the plugin lib folders to your libraries 
   * Click the Source tab and set the default output folder to 
Nutch_Trunk/bin/tmp_build. (You may need to create the tmp_build folder.)
@@ -72, +72 @@

  http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/
  
  Copy the jar files into src/plugin/parse-mp3/lib and 
src/plugin/parse-rtf/lib/ respectively.
- Then add the jar files to the build path (First refresh the workspace by 
pressing F5. Then right-click the project folder  Build Path  Configure Build 
Path...  Then select the Libraries tab, click Add Jars... and then add each 
.jar file individually).
+ Then add the jar files to the build path (First refresh the workspace by 
pressing F5. Then right-click the project folder  Build Path  Configure Build 
Path...  Then select the Libraries tab, click Add Jars... and then add each 
.jar file individually. If that does not work, you may try clicking Add 
External JARs and the point to the two the directories above).
  
  === Two Errors with RTFParseFactory ===
  


[Nutch Wiki] Update of GettingNutchRunningWithJboss b y TerrenceCurran

2009-11-09 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The GettingNutchRunningWithJboss page has been changed by TerrenceCurran.
http://wiki.apache.org/nutch/GettingNutchRunningWithJboss

--

New page:
= Running Nutch with JBoss AS 5.1 =

I only had to make minor changes beyond the basic Tomcat tutorials to get Nutch 
running on JBoss AS 5.1

== Deployment ==
Make sure that your nutch-site.xml file is configured in your packaged .war 
file, or exploded .war directory.

== Xerces ==
JBoss ships with a different version of Xerces installed and available to all 
deployed applications.  I was getting an error about the conflict.  Removing 
xerces-2_x_x-apis.jar and xerces-2_x_x.jar from the war file's lib directory 
fixed the problem.

== Code changes ==
In the file:
/src/java/org/apache/nutch/plugin/PluginManifestParser.java

Nutch checks has a check to make sure it can find the plugin folder.  The check 
looks like this:
{{{
  } else if (!file.equals(url.getProtocol())) {
LOG.warn(Plugins: not a file: url. Can't load plugins from:  + url);
return null;
  }
}}}

This does not work in jboss because local files in the deployment directory 
have a protocol of vfsfile://  since vfsfile acts just like file, you just have 
to change this code to:
{{{
  } else if (!file.equals(url.getProtocol()) 
 !vfsfile.equals(url.getProtocol())) {
LOG.warn(Plugins: not a file: url. Can't load plugins from:  + url);
return null;
  }
}}}


[Nutch Wiki] Update of FrontPage by TerrenceCurran

2009-11-09 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FrontPage page has been changed by TerrenceCurran.
http://wiki.apache.org/nutch/FrontPage?action=diffrev1=121rev2=122

--

   * GettingNutchRunningWithUtf8 - For support of non-ASCII characters 
(Chinese, German, Japanese, Korean).
   * GettingNutchRunningWithResin - Resin is a JSP/Servlet/EJB application 
server (alternative to tomcat).
   * GettingNutchRunningWithJetty
+  * GettingNutchRunningWithJboss
   * GettingNutchRunningWithUbuntu
   * GettingNutchRunningWithWindows
   * GettingNutchRunningWithMacOsx


[Nutch Wiki] Update of ApacheConUs2009MeetUp by KenKr ugler

2009-11-04 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The ApacheConUs2009MeetUp page has been changed by KenKrugler.
http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diffrev1=5rev2=6

--

- We were planning to have a Web Crawler Developer !MeetUp at this year's 
[[http://www.us.apachecon.com/c/acus2009/|ApacheCon US]] in Oakland.
+ We had a Web Crawler Developer !MeetUp at this year's 
[[http://www.us.apachecon.com/c/acus2009/|ApacheCon US]] in Oakland.
  
- Unfortunately the only time slot where people would be around was Thursday 
night, which wound up conflicting with the Hadoop !MeetUp.
+ It wound up being an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 
11am - 1pm. 
  
- So we're going to have an !UnMeetUp (!MeetDown?) on Wednesday, November 4th 
from 11am - 1pm. Location is TBD, hopefully we can get some space at the event 
but might be a lunch meeting :)
+ == Attendees ==
+ 
+  * Andrzej Bialeki - Apache Nutch
+  * Thorsten xxx - Apache Droids
+  * Michael Stack - Formerly with Heritrix, now HBase
+  * Ken Krugler - Bixo
+ 
+ == Topics ==
+ 
+ === Roadmaps ===
+ 
+ Nutch - become more component based.
+ Droids - get more people involved.
+ 
+ === Sharable Components ===
+ 
+  * robots.txt parsing
+  * URL normalization
+  * URL filtering
+  * Page cleansing
+   * General purpose
+   * Specialized
+  * Sub-page parsing (portlets)
+  * AJAX-ish page interactions
+  * Document parsing (via Tika)
+  * HttpClient (configuration)
+  * Text similarity
+  * Mime/charset/language detection
+ 
+ === Tika ===
+ 
+  * Needs help to become really usable
+  * Would benefit from large test corpus
+  * Could do comparison with Nutch parser
+  * Needs option for direct DOM querying (screen scraping tasks)
+  * Handles mime  charset detection now (some issues)
+  * Could be extended to include language detection (wrap other impl)
+ 
+ === URL Normalization ===
+ 
+  * Includes both domain (www.x.com == x.com), path, and query portions of URL
+  * Often site-specific rules
+   * Option to derive rules using URLs to similar documents.
+ 
+ === AJAX-ish Page Interaction ===
+ 
+  * Not applicable for broad/general crawling
+  * Can be very important for specific web sites
+  * Use Selenium or headless Mozilla
+ 
+ === Component API Issues ===
+ 
+  * Want to avoid using an API that's tied too closely to any implementation.
+  * One option is to have simple (e.g. URL param) API that takes meta-data.
+   * Similar to Tika passing in of meta-data.
+ 
+ === Hosting Options ===
+ 
+  * As part of Nutch - but easy to get lost in Nutch codebase, and can be 
associated too closely with Nutch.
+  * As part of Droids - but Droids is both a framework (queue-based) and set 
of components.
+  * New sub-project under Lucene TLP - but overhead to set up/maintain, and 
then confusion between it and Droids.
+  * Google code - seems like a good short-term solution, to judge level of 
interest and help shake out issues.
+ 
+ == Next Steps ==
+ 
+  * Get input from Gordon re Heritrix. Stack to follow up with him. Ideally 
he'd add his comments to this page.
+  * Get input from Thorsten on Google code option. If OK as starting point, 
then Andrzej to set up.
+  * Make decision about build system (and then move on to code formatting 
debate :))
+   * I'm going to propose ant + maven ant tasks for dependency management. I'm 
using this with Bixo, and so far it's been pretty good.
+  * Start contributing code
+   * Ken will put in robots.txt parser.
+ 
+ == Original Discussion Topic List ==
  
  Below are some potential topics for discussion - feel free to add/comment.
  


[Nutch Wiki] Update of ApacheConUs2009MeetUp by KenKr ugler

2009-11-04 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The ApacheConUs2009MeetUp page has been changed by KenKrugler.
http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diffrev1=6rev2=7

--

  == Attendees ==
  
   * Andrzej Bialeki - Apache Nutch
-  * Thorsten xxx - Apache Droids
+  * Thorsten Sherler - Apache Droids
   * Michael Stack - Formerly with Heritrix, now HBase
   * Ken Krugler - Bixo
  


[Nutch Wiki] Update of ApacheConUs2009MeetUp by KenKr ugler

2009-11-04 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The ApacheConUs2009MeetUp page has been changed by KenKrugler.
http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diffrev1=7rev2=8

--

  We had a Web Crawler Developer !MeetUp at this year's 
[[http://www.us.apachecon.com/c/acus2009/|ApacheCon US]] in Oakland.
  
  It wound up being an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 
11am - 1pm. 
+ 
+ -
  
  == Attendees ==
  
@@ -15, +17 @@

  
  === Roadmaps ===
  
- Nutch - become more component based.
+  * Nutch - become more component based.
- Droids - get more people involved.
+  * Droids - get more people involved.
  
  === Sharable Components ===
  
@@ -76, +78 @@

   * Start contributing code
* Ken will put in robots.txt parser.
  
+ -
+ 
  == Original Discussion Topic List ==
  
  Below are some potential topics for discussion - feel free to add/comment.


[Nutch Wiki] Update of ApacheConUs2009MeetUp by Andrz ejBialecki

2009-11-04 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The ApacheConUs2009MeetUp page has been changed by AndrzejBialecki.
http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diffrev1=8rev2=9

--

  
  == Attendees ==
  
-  * Andrzej Bialeki - Apache Nutch
+  * Andrzej Bialecki - Apache Nutch
   * Thorsten Sherler - Apache Droids
   * Michael Stack - Formerly with Heritrix, now HBase
   * Ken Krugler - Bixo


[Nutch Wiki] Update of ApacheConUs2009MeetUp by KenKr ugler

2009-10-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The ApacheConUs2009MeetUp page has been changed by KenKrugler.
http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diffrev1=4rev2=5

--

- We're planning to have a Web Crawler Developer !MeetUp at this year's 
[[http://www.us.apachecon.com/c/acus2009/|ApacheCon US]] in Oakland.
+ We were planning to have a Web Crawler Developer !MeetUp at this year's 
[[http://www.us.apachecon.com/c/acus2009/|ApacheCon US]] in Oakland.
  
- Tentative plan is for Thursday evening, November 5th. The actual schedule for 
!MeetUps is [[http://wiki.apache.org/apachecon/ApacheMeetupsUs09|here]].
+ Unfortunately the only time slot where people would be around was Thursday 
night, which wound up conflicting with the Hadoop !MeetUp.
+ 
+ So we're going to have an !UnMeetUp (!MeetDown?) on Wednesday, November 4th 
from 11am - 1pm. Location is TBD, hopefully we can get some space at the event 
but might be a lunch meeting :)
  
  Below are some potential topics for discussion - feel free to add/comment.
  


[Nutch Wiki] Update of DownloadingNutch by SteveKearn s

2009-10-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The DownloadingNutch page has been changed by SteveKearns.
http://wiki.apache.org/nutch/DownloadingNutch?action=diffrev1=5rev2=6

--

  You have two choices in how to get Nutch:
-   1. You can download a release from http://lucene.apache.org/nutch/release/. 
 This will give you a relatively stable release.  At the moment the latest 
release is 0.9.
+   1. You can download a release from http://lucene.apache.org/nutch/release/. 
 This will give you a relatively stable release.  At the moment the latest 
release is 1.0.
-   2. Or, you can check out the latest source code from subversion and build 
it with Ant.  This gets you closer to the bleeding edge of development.  The 
0.9 should be relatively stable but the trunk (from which the 
[[http://lucene.apache.org/nutch/nightly.html|nightly builds]] are build) is 
under heavy development with bugs showing up and getting squashed fairly 
frequently. 
+   2. Or, you can check out the latest source code from subversion and build 
it with Ant.  This gets you closer to the bleeding edge of development.  The 
1.0 release should be relatively stable but the trunk (from which the 
[[http://lucene.apache.org/nutch/nightly.html|nightly builds]] are build) is 
under heavy development with bugs showing up and getting squashed fairly 
frequently. 
  
  Note: As of 5/29/08 the Subversion trunk seems to be much better than the 0.9 
release. If you have trouble with 0.9 your best bet is to try moving to trunk 
and see if the problems resolve themselves.
  


[Nutch Wiki] Trivial Update of 首页 by yongping8204

2009-10-24 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The 首页 page has been changed by yongping8204.
http://wiki.apache.org/nutch/%E9%A6%96%E9%A1%B5?action=diffrev1=4rev2=5

--

  #format wiki
  #language zh
  #pragma section-numbers off
- 
  = 维基链接名 维基 =
  您也许可以从这些连接开始:
+ 
-  * [[最新改动]]: 谁最近改动了什么
+  * [[最新改动]]: 谁最近改动了什么 (我在修改)
   * [[维基沙盘演练]]: 您可以随意改动编辑,热身演练
   * [[查找网页]]: 用多种方法搜索浏览这个站点
-  * [[语法参考]]: 维基语法简便参考 
+  * [[语法参考]]: 维基语法简便参考
   * [[站点导航]]: 本站点内容概要
+ 
  这个维基是有关什么的?
  
  测试
  
+ == 如何使用这个站点 ==
+ 维基(wiki)是一种协同合作网站,任何人都可以参与网站的建立、编辑和维护并分享网站的内容:
  
- == 如何使用这个站点 ==
- 
- 维基(wiki)是一种协同合作网站,任何人都可以参与网站的建立、编辑和维护并分享网站的内容:
   * 点击每个网页页眉或页尾中的'''GetText(Edit)'''就可以随意编辑改动这个网页。
   * 
创建一个链接简单的不能再简单了:您可以使用连在一起的,每个单词第一个字母大写,但不用空格分隔的词组(比如WikiSandBox),也可以用{{{[quoted
 words in brackets]}}}。简体中文的链接可以使用后者,比如{{{[维基沙盘演练]}}}。
   * 每页的页眉中的搜索框可以用来将进行网页标题搜索或者进行全文检索。


[Nutch Wiki] Update of Support by KelvinTan

2009-09-11 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by KelvinTan:
http://wiki.apache.org/nutch/Support

--
* Sudhi Seshachala sudhi_...@yahoo.com Please visit 
http://www.myopensourcejobs.com (Built on LAMP and Nutch)
* http://www.termindoc.de (SP data GmbH, Germany schackenberg at 
termindoc.de)
* [http://www.mint.nl/ MINT] (Media Integration) info at mint.nl
-   * [http://www.supermind.org/ Kelvin Tan] kelvint at apache.org
+   * [http://www.supermind.org/ Kelvin Tan] Kelvin Tan - Lucene, Solr and 
Nutch consulting. Specializes in vertical search.
* [http://www.tokenizer.org/ Tokenizer Inc.] Fuad Efendi, director, 
[1](416)993-2060, first_name at last_name.ca. Toronto, Canada.
* [http://www.webdev2b.com Vladimir Brezhnev] rsdsoft at gmail.com
* [http://www.wyona.com/ Wyona] open source software development, contact 
at wyona.com 


[Nutch Wiki] Update of Support by Justin Gilbreath

2009-09-04 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by Justin Gilbreath:
http://wiki.apache.org/nutch/Support

--
  
  Entries are listed alphabetically by company or last name.
  
-   * [http://www.30digits.com/ 30 Digits] - Implementation, consulting, 
support, and value-add components (i.e. spiders, UI, security) for Nutch, 
Lucene and Solr.  Based in Germany with customers across Europe and North 
America. contact at 30digits.com
+   * [http://www.30digits.com/ 30 Digits] - Implementation, consulting, 
support, and value-add components (i.e. spiders, UI, security) for Nutch, 
Lucene and Solr.  Based in Germany (Deutschland) with customers across Europe 
and North America. contact at 30digits.com
* [http://www.sigram.com Andrzej Bialecki] ab at sigram.com
* CNLP  http://www.cnlp.org/tech/lucene.asp
* [http://www.digitalpebble.com/ DigitalPebble Ltd.] contact at 
digitalpebble.com. Norwich, UK.


[Nutch Wiki] Update of PublicServers by ReinierBattenberg

2009-08-08 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by ReinierBattenberg:
http://wiki.apache.org/nutch/PublicServers

--
  
* [http://www.misterbot.fr Misterbot.fr] a search engine for french 
language web sites.
   
+   * [http://search.mountbatten.net Mountbatten Search] a search engine that 
crawls only the part of the Internet located in Uganda.
+ 
* [http://www.mozdex.com mozDex].com Running Nutch SVN release with 
Clustering  Ontology support enabled.
  
* [http://www.myopensourcejobs.com MyOpensourcejobs] A Opensource skills 
jobs site using NUTCH and LAMP basedDRUPAL CMS.


[Nutch Wiki] Trivial Update of FrontPage by KenKrugler

2009-08-03 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by KenKrugler:
http://wiki.apache.org/nutch/FrontPage

--
   * [Getting Started]
   * JavaDemoApplication - A simple demonstration of how to use the Nutch APIin 
a Java application
   * InstallingWeb2
+  * ApacheConUs2009MeetUp - List of topics for !MeetUp at !ApacheCon US 2009 
in Oakland (Nov 2-6)
  
  == Nutch 2.0 ==
   * [Nutch2Architecture] -- Discussions on the Nutch 2.0 architecture.


[Nutch Wiki] Trivial Update of FrontPage by KenKrugler

2009-08-03 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by KenKrugler:
http://wiki.apache.org/nutch/FrontPage

--
   * [Getting Started]
   * JavaDemoApplication - A simple demonstration of how to use the Nutch APIin 
a Java application
   * InstallingWeb2
-  * ApacheConUs2009MeetUp - List of topics for !MeetUp at !ApacheCon US 2009 
in Oakland (Nov 2-6)
+  * [ApacheConUs2009MeetUp ApacheCon US 2009 MeetUp] - List of topics for 
!MeetUp at !ApacheCon US 2009 in Oakland (Nov 2-6)
  
  == Nutch 2.0 ==
   * [Nutch2Architecture] -- Discussions on the Nutch 2.0 architecture.


[Nutch Wiki] Trivial Update of FrontPage by KenKrugler

2009-08-03 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by KenKrugler:
http://wiki.apache.org/nutch/FrontPage

--
   * [Getting Started]
   * JavaDemoApplication - A simple demonstration of how to use the Nutch APIin 
a Java application
   * InstallingWeb2
-  * [ApacheConUs2009MeetUp ApacheCon US 2009 MeetUp] - List of topics for 
!MeetUp at !ApacheCon US 2009 in Oakland (Nov 2-6)
+  * ApacheConUs2009MeetUp - List of topics for !MeetUp at !ApacheCon US 2009 
in Oakland (Nov 2-6)
  
  == Nutch 2.0 ==
   * [Nutch2Architecture] -- Discussions on the Nutch 2.0 architecture.


[Nutch Wiki] Update of ApacheConUs2009MeetUp by KenKrugler

2009-08-03 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by KenKrugler:
http://wiki.apache.org/nutch/ApacheConUs2009MeetUp

The comment on the change is:
List of potential discussion topics for ApacheCon US 2009 MeetUp

New page:
We're planning to have a Web Crawler Developer !MeetUp at this year's 
ApacheCon US in Oakland.

Tentative plan is for Thursday evening, November 5th. The actual schedule for 
!MeetUps is [http://wiki.apache.org/apachecon/ApacheMeetupsUs09 here].

Below are some potential topics for discussion - feel free to add/comment.

* Potential synergies between crawler projects - e.g. sharing robots.txt 
processing code.
* How to avoid end-user abuse - webmasters sometimes block crawlers because 
users configure it to be impolite.
* Politeness vs. efficiency - various options for how to be considered polite, 
while still crawling quickly.
* robots.txt processing - current problems with existing implementations
* Avoiding crawler traps - link farms, honeypots, etc.
* Parsing content - home grown, Neko/TagSoup, Tika, screen scraping
* Search infrastructure - options for serving up crawl results (Nutch, Solr, 
Katta, others?)
* Testing challenges - is it possible to unit test a crawler?
* Fuzzy classification - mime-type, charset, language.
* The future of Nutch, Droids, Heritrix, Bixo, etc.
* Optimizing for types of crawling - intranet, focused, whole web.


[Nutch Wiki] Trivial Update of ApacheConUs2009MeetUp by KenKrugler

2009-08-03 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by KenKrugler:
http://wiki.apache.org/nutch/ApacheConUs2009MeetUp

--
  
  Below are some potential topics for discussion - feel free to add/comment.
  
- * Potential synergies between crawler projects - e.g. sharing robots.txt 
processing code.
+  * Potential synergies between crawler projects - e.g. sharing robots.txt 
processing code.
- * How to avoid end-user abuse - webmasters sometimes block crawlers because 
users configure it to be impolite.
+  * How to avoid end-user abuse - webmasters sometimes block crawlers because 
users configure it to be impolite.
- * Politeness vs. efficiency - various options for how to be considered 
polite, while still crawling quickly.
+  * Politeness vs. efficiency - various options for how to be considered 
polite, while still crawling quickly.
- * robots.txt processing - current problems with existing implementations
+  * robots.txt processing - current problems with existing implementations
- * Avoiding crawler traps - link farms, honeypots, etc.
+  * Avoiding crawler traps - link farms, honeypots, etc.
- * Parsing content - home grown, Neko/TagSoup, Tika, screen scraping
+  * Parsing content - home grown, Neko/TagSoup, Tika, screen scraping
- * Search infrastructure - options for serving up crawl results (Nutch, Solr, 
Katta, others?)
+  * Search infrastructure - options for serving up crawl results (Nutch, Solr, 
Katta, others?)
- * Testing challenges - is it possible to unit test a crawler?
+  * Testing challenges - is it possible to unit test a crawler?
- * Fuzzy classification - mime-type, charset, language.
+  * Fuzzy classification - mime-type, charset, language.
- * The future of Nutch, Droids, Heritrix, Bixo, etc.
+  * The future of Nutch, Droids, Heritrix, Bixo, etc.
- * Optimizing for types of crawling - intranet, focused, whole web.
+  * Optimizing for types of crawling - intranet, focused, whole web.
  


[Nutch Wiki] Trivial Update of ApacheConUs2009MeetUp by KenKrugler

2009-08-03 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by KenKrugler:
http://wiki.apache.org/nutch/ApacheConUs2009MeetUp

--
- We're planning to have a Web Crawler Developer !MeetUp at this year's 
ApacheCon US in Oakland.
+ We're planning to have a Web Crawler Developer !MeetUp at this year's 
[http://www.us.apachecon.com/c/acus2009/ ApacheCon US] in Oakland.
  
  Tentative plan is for Thursday evening, November 5th. The actual schedule for 
!MeetUps is [http://wiki.apache.org/apachecon/ApacheMeetupsUs09 here].
  
@@ -11, +11 @@

   * Politeness vs. efficiency - various options for how to be considered 
polite, while still crawling quickly.
   * robots.txt processing - current problems with existing implementations
   * Avoiding crawler traps - link farms, honeypots, etc.
-  * Parsing content - home grown, Neko/TagSoup, Tika, screen scraping
+  * Parsing content - home grown, Neko/!TagSoup, Tika, screen scraping
   * Search infrastructure - options for serving up crawl results (Nutch, Solr, 
Katta, others?)
   * Testing challenges - is it possible to unit test a crawler?
   * Fuzzy classification - mime-type, charset, language.


[Nutch Wiki] Update of PublicServers by stoicleo

2009-07-30 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by stoicleo:
http://wiki.apache.org/nutch/PublicServers

--
  Please sort by name alphabetically
  
* [http://askaboutoil.com AskAboutOil] is a vertical search portal for the 
petroleum industry.
+ 
+   * [http://www.asbestosinfo.info Asbestos] is a vertical search portal and 
discussion forum for the asbestos and related information.
  
* [http://www.baynote.com/go Baynote] provides free hosted Nutch search for 
businesses.
  


[Nutch Wiki] Update of FrontPage by AlexMc

2009-07-29 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by AlexMc:
http://wiki.apache.org/nutch/FrontPage

--
   * [:Automating_Fetches_with_Python:Automating Fetches with Python] - How to 
automatic the Nutch fetching process using Python
   * [:Upgrading_Hadoop:Upgrading Hadoop Version in Nutch] - Basic steps for 
upgrading Hadoop in Nutch.
   * [FAQ]
-  * [:CommandLineOptions:Commandline] options for 0.7.x
+  * [:07CommandLineOptions:Commandline] options for 0.7.x
   * [:08CommandLineOptions:Commandline] options for version 0.8
+  * Current CommandLineOptions
   * OverviewDeploymentConfigs
   * NutchConfigurationFiles
   * GettingNutchRunningWithUtf8 - For support of non-ASCII characters 
(Chinese, German, Japanese, Korean).


[Nutch Wiki] Update of bin/nutch readdb by AlexMc

2009-07-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by AlexMc:
http://wiki.apache.org/nutch/bin/nutch_readdb

--
  
  CommandLineOptions
  
+ 
+ (Actually this looks out of date. You might be looking for 
org.apache.nutch.crawl.CrawlDBReader instead)
+ 


[Nutch Wiki] Trivial Update of bin/nutch readdb by AlexMc

2009-07-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by AlexMc:
http://wiki.apache.org/nutch/bin/nutch_readdb

--
  CommandLineOptions
  
  
- (Actually this looks out of date. You might be looking for 
org.apache.nutch.crawl.CrawlDBReader instead)
+ (Actually this looks out of date. You might be looking for 
org.apache.nutch.crawl.CrawlDbReader instead)
  


[Nutch Wiki] Update of FrontPage by DanielZhou

2009-07-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by DanielZhou:
http://wiki.apache.org/nutch/FrontPage

--
   * HttpAuthenticationSchemes - How to enable Nutch to authenticate itself 
using NTLM, Basic or Digest authentication schemes.
   * NonDefaultIntranetCrawlingOptions - Desirable options to add to your 
intranet crawling configuration.
   * RunningNutchAndSolr - How to configure Nutch to crawl, but post to Solr 
for search/index
+  * NutchWithChineseAnalyzer - References to some Chinese articles explaining 
how to setup Nutch with 3rd party Chinese analyzers
  
  == Nutch Development ==
   * [:Becoming_A_Nutch_Developer:Becoming a Nutch Developer] - Start 
developing and contributing to Nutch.


[Nutch Wiki] Update of AddingNewLocalization by Mike Dawson

2009-06-20 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by Mike Dawson:
http://wiki.apache.org/nutch/AddingNewLocalization

New page:
===Adding a New Language to Nutch===

If you want to have Nutch in your language - hopefully the below helps.  I just 
Googled around.

* Unzip Nutch 1.0 to any folder

* Translate the .properties files that you find in src/web/locale/org/nutch/jsp 
:
** For each file make sure that you have your own version ending in 
_langcode.properties e.g. _fa.properties .  Btw OmegaT is an excellent 
Translation memory program to help with standardizing terms etc.

* Make a folder src/web/include/langcode with a file header.xml - again this 
needs translated.
* Make a folder src/web/pages/langcode and copy the .xml files from the 
English folder and then translate them.  In search.xml look for the line:
pre
input type=hidden name=lang value=fa/
/pre
Change the value of lang to match the language you are adding (e.g. fa)

* Add your language to src/web/include/footer.html

* In the Nutch base directory run ant

pre
ant generate-docs
/pre

* Work in progress - I now find that when doing the search it still comes back 
in English... for some reason it seems like the JSP loads the resource bundle 
according to the language passed by the browser headers, not according to the 
lang parameter...


[Nutch Wiki] Update of AddingNewLocalization by Mike Dawson

2009-06-20 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by Mike Dawson:
http://wiki.apache.org/nutch/AddingNewLocalization

--
  
  If you want to have Nutch in your language - hopefully the below helps.  I 
just Googled around.
  
- * Unzip Nutch 1.0 to any folder
+  * Unzip Nutch 1.0 to any folder
  
- * Translate the .properties files that you find in 
src/web/locale/org/nutch/jsp :
+  * Translate the .properties files that you find in 
src/web/locale/org/nutch/jsp :
- ** For each file make sure that you have your own version ending in 
_langcode.properties e.g. _fa.properties .  Btw OmegaT is an excellent 
Translation memory program to help with standardizing terms etc.
+  * For each file make sure that you have your own version ending in 
_langcode.properties e.g. _fa.properties .  Btw OmegaT is an excellent 
Translation memory program to help with standardizing terms etc.
  
- * Make a folder src/web/include/langcode with a file header.xml - again 
this needs translated.
+  * Make a folder src/web/include/langcode with a file header.xml - again 
this needs translated.
- * Make a folder src/web/pages/langcode and copy the .xml files from the 
English folder and then translate them.  In search.xml look for the line:
+  * Make a folder src/web/pages/langcode and copy the .xml files from the 
English folder and then translate them.  In search.xml look for the line:
- pre
+ 
+ {{{
  input type=hidden name=lang value=fa/
- /pre
+ }}}
  Change the value of lang to match the language you are adding (e.g. fa)
  
- * Add your language to src/web/include/footer.html
+  * Add your language to src/web/include/footer.html
  
- * In the Nutch base directory run ant
+  * In the Nutch base directory run ant
  
- pre
+ {{{
  ant generate-docs
- /pre
+ }}}
  
- * Work in progress - I now find that when doing the search it still comes 
back in English... for some reason it seems like the JSP loads the resource 
bundle according to the language passed by the browser headers, not according 
to the lang parameter...
+  * Work in progress - I now find that when doing the search it still comes 
back in English... for some reason it seems like the JSP loads the resource 
bundle according to the language passed by the browser headers, not according 
to the lang parameter...
  


[Nutch Wiki] Update of AddingNewLocalization by Mike Dawson

2009-06-20 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by Mike Dawson:
http://wiki.apache.org/nutch/AddingNewLocalization

--
- ===Adding a New Language to Nutch===
+ = Adding a New Language to Nutch =
  
- If you want to have Nutch in your language - hopefully the below helps.  I 
just Googled around.
+ If you want to have Nutch in your language - hopefully the below helps.  I 
have been Googling around and digging in some source code...
  
   * Unzip Nutch 1.0 to any folder
  
@@ -25, +25 @@

  ant generate-docs
  }}}
  
-  * Work in progress - I now find that when doing the search it still comes 
back in English... for some reason it seems like the JSP loads the resource 
bundle according to the language passed by the browser headers, not according 
to the lang parameter...
+  * It seems like some changes are needed to search.jsp to make it behave as 
users would expect.  The original appears to expect the language of the browser 
to take precedence over the language selected...  After out.flush() at about 
line 160 add the following in src/web/jsp/search.jsp:
  
+ {{{
+ 
+   //see what locale we should use
+   Locale ourLocale = null;
+   if(!queryLang.equals()) {
+   ourLocale = new Locale(queryLang);
+   language = new String(queryLang);
+   }else {
+   ourLocale = request.getLocale();
+   }
+ 
+ }}}
+ 
+ Then change the line:
+ 
+ {{{
+ i18n:bundle baseName=org.nutch.jsp.search/
+ }}}
+ 
+ to:
+ 
+ {{{
+ i18n:bundle baseName=org.nutch.jsp.search locale=%=ourLocale%/
+ }}}
+ 
+ * Now we are ready to build it:
+ 
+ {{{
+ ant war
+ }}}
+ 
+ * Copy the .war file to your servlet container's webapp directory.  If 
everything went well you will see your language code in the bottom, then you 
can select it, and the search interface will come back with the localisation 
you just put in.
+ 


[Nutch Wiki] Update of Support by Justin Gilbreath

2009-06-17 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by Justin Gilbreath:
http://wiki.apache.org/nutch/Support

--
  
  Entries are listed alphabetically by company or last name.
  
+   * [http://www.30digits.com/ 30 Digits] - Implementation, consulting, 
support, and value-add components (i.e. spiders, UI, security) for Nutch, 
Lucene and Solr.  Based in Germany with customers across Europe and North 
America.
* [http://www.sigram.com Andrzej Bialecki] ab at sigram.com
* CNLP  http://www.cnlp.org/tech/lucene.asp
* [http://www.digitalpebble.com/ DigitalPebble Ltd.] contact at 
digitalpebble.com. Norwich, UK.


[Nutch Wiki] Update of Support by Justin Gilbreath

2009-06-17 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by Justin Gilbreath:
http://wiki.apache.org/nutch/Support

--
  
  Entries are listed alphabetically by company or last name.
  
-   * [http://www.30digits.com/ 30 Digits] - Implementation, consulting, 
support, and value-add components (i.e. spiders, UI, security) for Nutch, 
Lucene and Solr.  Based in Germany with customers across Europe and North 
America.
+   * [http://www.30digits.com/ 30 Digits] - Implementation, consulting, 
support, and value-add components (i.e. spiders, UI, security) for Nutch, 
Lucene and Solr.  Based in Germany with customers across Europe and North 
America. contact at 30digits.com
* [http://www.sigram.com Andrzej Bialecki] ab at sigram.com
* CNLP  http://www.cnlp.org/tech/lucene.asp
* [http://www.digitalpebble.com/ DigitalPebble Ltd.] contact at 
digitalpebble.com. Norwich, UK.


[Nutch Wiki] Update of HttpAuthenticationSchemes by wobbet

2009-06-17 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by wobbet:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

--
  
  == Configuration ==
  Since the example and explanation provided as comments in 
'conf/httpclient-auth.xml' is very brief, therefore this section would explain 
it in a little more detail. In all the examples below, the root element 
auth-configuration has been omitted for the sake of clarity.
+ 
+ === Prerequisites ===
+ In order use HTTP Authentication your Nutch install must be configured to use 
'protocol-httpclient' instead of the default 'protocol-http'. To make this 
change copy the 'plugin.includes' property from 'conf/nutch-default.xml' and 
paste it into 'conf/nutch-site.xml'. Within that property replace 
'protocol-http' with 'protocol-httpclient'. If you have made no other changes 
it will look as follows:
+ {{{
+ property
+   nameplugin.includes/name
+   
valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value
+   descriptionRegular expression naming plugin directory names to
+   include.  Any plugin not matching this expression is excluded.
+   In any case you need at least include the nutch-extensionpoints plugin. By
+   default Nutch includes crawling just HTML and plain text via HTTP,
+   and basic indexing and search plugins. In order to use HTTPS please enable 
+   protocol-httpclient, but be aware of possible intermittent problems with 
the 
+   underlying commons-httpclient library.
+   /description
+ /property
+ }}}
+ 
+ === Optional ===
+ By default Nutch use credential from 'httpclient-auth.xml'. If you wish to 
use a different file you will need to copy the 'http.auth.file' property from 
'conf/nutch-default.xml' and paste it into 'conf/nutch-site.xml' and then 
modify the 'value' element. The default property appears as follows:
+ {{{
+ property
+   namehttp.auth.file/name
+   valuehttpclient-auth.xml/value
+   descriptionAuthentication configuration file for 'protocol-httpclient' 
plugin./description
+ /property
+ }}}
+ 
  
  === Crawling an Intranet with Default Authentication Scope ===
  Let's say all pages of an intranet are protected by basic, digest or ntlm 
authentication and there is only one set of credentials to be used for all web 
pages in the intranet, then a configuration as described below is enough. This 
is also the simplest possible configuration possible for authentication schemes.


[Nutch Wiki] Update of HttpAuthenticationSchemes by susam

2009-06-17 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The comment on the change is:
Added TableOfContents and minor edits in Prerequisites and Optional sect

--
+ [[TableOfContents]]
+ 
  == Introduction ==
  This is a feature in Nutch that allows the crawler to authenticate itself to 
websites requiring NTLM, Basic or Digest authentication. This feature can not 
do POST based authentication that depends on cookies. More information on this 
can be found at: HttpPostAuthentication
  
@@ -18, +20 @@

  Since the example and explanation provided as comments in 
'conf/httpclient-auth.xml' is very brief, therefore this section would explain 
it in a little more detail. In all the examples below, the root element 
auth-configuration has been omitted for the sake of clarity.
  
  === Prerequisites ===
- In order use HTTP Authentication your Nutch install must be configured to use 
'protocol-httpclient' instead of the default 'protocol-http'. To make this 
change copy the 'plugin.includes' property from 'conf/nutch-default.xml' and 
paste it into 'conf/nutch-site.xml'. Within that property replace 
'protocol-http' with 'protocol-httpclient'. If you have made no other changes 
it will look as follows:
+ In order to use HTTP Authentication, the Nutch crawler must be configured to 
use 'protocol-httpclient' instead of the default 'protocol-http'. To do this 
copy 'plugin.includes' property from 'conf/nutch-default.xml' into 
'conf/nutch-site.xml'. Replace 'protocol-http' with 'protocol-httpclient' in 
the value of the property. If you have made no other changes it should look as 
follows:
  {{{
  property
nameplugin.includes/name
@@ -35, +37 @@

  }}}
  
  === Optional ===
- By default Nutch use credential from 'httpclient-auth.xml'. If you wish to 
use a different file you will need to copy the 'http.auth.file' property from 
'conf/nutch-default.xml' and paste it into 'conf/nutch-site.xml' and then 
modify the 'value' element. The default property appears as follows:
+ By default Nutch uses credentials from 'conf/httpclient-auth.xml'. If you 
wish to use a different file, the file should be placed in the 'conf' directory 
and 'http.auth.file' property should be copied from 'conf/nutch-default.xml' 
into 'conf/nutch-site.xml' and then the file name in the 'value' element 
should be edited accordingly. The default property appears as follows:
  {{{
  property
namehttp.auth.file/name
@@ -43, +45 @@

descriptionAuthentication configuration file for 'protocol-httpclient' 
plugin./description
  /property
  }}}
- 
  
  === Crawling an Intranet with Default Authentication Scope ===
  Let's say all pages of an intranet are protected by basic, digest or ntlm 
authentication and there is only one set of credentials to be used for all web 
pages in the intranet, then a configuration as described below is enough. This 
is also the simplest possible configuration possible for authentication schemes.


[Nutch Wiki] Update of IntranetRecrawl by susam

2009-06-09 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/IntranetRecrawl

--
  echo  frequent depending on disk constraints) and a new crawl generated.
  }}}
  
+ == Version 1.0 ==
+ A crawl script that runs properly with bash and has been tested with Nutch 
1.0 can be found here: Crawl
+ 


[Nutch Wiki] Update of IntranetRecrawl by susam

2009-06-09 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/IntranetRecrawl

The comment on the change is:
Link to crawl script for Nutch 1.0

--
  }}}
  
  == Version 1.0 ==
- A crawl script that runs properly with bash and has been tested with Nutch 
1.0 can be found here: Crawl
+ A crawl script that runs properly with bash and has been tested with Nutch 
1.0 can be found here: Self:Crawl. This script can do crawl as well as recrawl. 
However, not much real world recrawl has been done with this script. It might 
require a little bit of tweaking if you find that the script does not suit your 
needs.
  


[Nutch Wiki] Update of Support by JulienNioche

2009-06-02 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by JulienNioche:
http://wiki.apache.org/nutch/Support

--
  
* [http://www.sigram.com Andrzej Bialecki] ab at sigram.com
* CNLP  http://www.cnlp.org/tech/lucene.asp
+   * [http://www.digitalpebble.com/ DigitalPebble Ltd.] contact at 
digitalpebble.com. Norwich, UK.
* [http://www.doculibre.com/ Doculibre Inc.] Open source and information 
management consulting. (Lucene, Nutch, Hadoop, Solr, Lius etc.) info at 
doculibre.com
* [http://www.dsen.nl Thomas Delnoij (DSEN) - Java | J2EE | Agile 
Development  Consultancy]
* eventax GmbH info at eventax.com


[Nutch Wiki] Update of FrontPage by JohnWhelan

2009-06-02 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by JohnWhelan:
http://wiki.apache.org/nutch/FrontPage

The comment on the change is:
Adding 'WhelanLabs SearchEngine Manager' under 'other resources'.

--
   * [http://blog.foofactory.fi/ FooFactory] Nutch and Hadoop related posts
   * [http://spinn3r.com Spinn3r] [http://spinn3r.com/opensource.php Open 
Source components] (our contribution to the crawling OSS community with more to 
come).
   * [http://www.interadvertising.co.uk/blog/nutch_logos Larger / better 
quality Nutch logos] Re-created Nutch logos available in GIF, PNG  EPS in 
resolutions up to 1200 x 449
+  * [http://www.whelanlabs.com/content/SearchEngineManager.htm WhelanLabs 
SearchEngine Manager] An all-in-one, bundled implementation of Nutch, Tomcat, 
and Cygwin, and JRE for Microsoft Windows. Includes an installer and a 
simplified administrative UI.
  


[Nutch Wiki] Update of GettingNutchRunningWithWindows by JohnWhelan

2009-06-02 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by JohnWhelan:
http://wiki.apache.org/nutch/GettingNutchRunningWithWindows

--
  Since Nutch is written in Java, it is possible to get Nutch working in a 
Windows environment, provided that the correct software is installed.
+ 
+ Note: If you're just interested in a basic installation on Windows and are 
not interested in knowing the details of how it is done, you might want check 
and see if the[http://www.whelanlabs.com/content/SearchEngineManager.htm 
WhelanLabs SearchEngine Manager] fits your needs. It is a free installer for 
Nutch on Windows.
  
  The following documents describe how I got it working on Windows XP Pro 
running Tomcat 5.28.  Edit: page updated with my experience installing on 
Windows Server 2003.
  


[Nutch Wiki] Trivial Update of HttpAuthenticationSchemes by susam

2009-05-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

--
  == Introduction ==
- This is a feature in Nutch, developed by Susam Pal, that allows the crawler 
to authenticate itself to websites requiring NTLM, Basic or Digest 
authentication. This feature can not do POST based authentication that depends 
on cookies. More information on this can be found at: HttpPostAuthentication
+ This is a feature in Nutch that allows the crawler to authenticate itself to 
websites requiring NTLM, Basic or Digest authentication. This feature can not 
do POST based authentication that depends on cookies. More information on this 
can be found at: HttpPostAuthentication
  
  == Necessity ==
  There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. Some portions of 'protocol-httpclient' 
were re-written to solve these problems, provide additional features like 
authentication support for proxy server and better inline documentation for the 
properties to be used to configure authentication.
@@ -108, +108 @@

  Once you have checked the items listed above and you are still unable to fix 
the problem or confused about any point listed above, please mail the issue 
with the following information:
  
   1. Version of Nutch you are running.
-  1. Complete code in ''conf/httpclient-auth.xml' file.
+  1. Complete code in 'conf/httpclient-auth.xml' file.
   1. Relevant portion from 'logs/hadoop.log' file. If you are clueless, send 
the complete file.
  


[Nutch Wiki] Update of RunningNutchAndSolr by amitkumar

2009-05-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by amitkumar:
http://wiki.apache.org/nutch/RunningNutchAndSolr

--
  -  private static class LuceneDocumentWrapper implements Writable {
  +  public static class LuceneDocumentWrapper implements Writable { ).
  
+ 
+ HI, I to faced problems to integrate solr and nutch.After , some work i found 
the below article and integrated successfully. 
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
+ 


[Nutch Wiki] Update of RunningNutchAndSolr by amitkumar

2009-05-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by amitkumar:
http://wiki.apache.org/nutch/RunningNutchAndSolr

--
  +  public static class LuceneDocumentWrapper implements Writable { ).
  
  
- HI, I to faced problems to integrate solr and nutch.After , some work i found 
the below article and integrated successfully. 
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
  
+ HI, I to faced problems in integrating solr and nutch. After, some work out i 
found the below article and integrated successfully. 
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
+ 


[Nutch Wiki] Update of RunningNutchAndSolr by amitkumar

2009-05-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by amitkumar:
http://wiki.apache.org/nutch/RunningNutchAndSolr

--
  d. Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste 
following fragment to it
  
  requestHandler name=/nutch class=solr.SearchHandler 
+ 
  lst name=defaults
+ 
  str name=defTypedismax/str
+ 
  str name=echoParamsexplicit/str
+ 
  float name=tie0.01/float
+ 
  str name=qf
+ 
  content^0.5 anchor^1.0 title^1.2
  /str
+ 
  str name=pf
  content^0.5 anchor^1.5 title^1.2 site^1.5
  /str
+ 
  str name=fl
  url
  /str
+ 
  str name=mm
  2lt;-1 5lt;-2 6lt;90%
  /str
+ 
  int name=ps100/int
+ 
  bool hl=true/
+ 
  str name=q.alt*:*/str
+ 
  str name=hl.fltitle url content/str
+ 
  str name=f.title.hl.fragsize0/str
+ 
  str name=f.title.hl.alternateFieldtitle/str
+ 
  str name=f.url.hl.fragsize0/str
+ 
  str name=f.url.hl.alternateFieldurl/str
+ 
  str name=f.content.hl.fragmenterregex/str
+ 
  /lst
+ 
  /requestHandler
  
  6. Start Solr


[Nutch Wiki] Update of RunningNutchAndSolr by amitkumar

2009-05-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by amitkumar:
http://wiki.apache.org/nutch/RunningNutchAndSolr

--
   * apt-get install sun-java6-jdk subversion ant patch unzip
  
  == Steps ==
-  Setup
  
  The first step to get started is to download the required software 
components, namely Apache Solr and Nutch.
  
- 1. Download Solr version 1.3.0 or LucidWorks for Solr from Download page
+ '''1.''' Download Solr version 1.3.0 or LucidWorks for Solr from Download page
  
- 2. Extract Solr package
+ '''2.''' Extract Solr package
  
- 3. Download Nutch version 1.0 or later (Alternatively download the the 
nightly version of Nutch that contains the required functionality)
+ '''3.''' Download Nutch version 1.0 or later (Alternatively download the the 
nightly version of Nutch that contains the required functionality)
  
- 4. Extract the Nutch package
+ '''4.''' Extract the Nutch package   tar xzf apache-nutch-1.0.tar.gz
  
- tar xzf apache-nutch-1.0.tar.gz
- 
- 5. Configure Solr
+ '''5.''' Configure Solr
- 
  For the sake of simplicity we are going to use the example
  configuration of Solr as a base.
  
- a. Copy the provided Nutch schema from directory
+ '''a.''' Copy the provided Nutch schema from directory
  apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf 
(override the existing file)
  
  We want to allow Solr to create the snippets for search results so we need to 
store the content in addition to indexing it:
  
- b. Change schema.xml so that the stored attribute of field “content” is 
true.
+ '''b.''' Change schema.xml so that the stored attribute of field 
“content” is true.
  
  field name=”content” type=”text” stored=”true” 
indexed=”true”/
  
  We want to be able to tweak the relevancy of queries easily so we’ll create 
new dismax request handler configuration for our use case:
  
- d. Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste 
following fragment to it
+ '''d.''' Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste 
following fragment to it
  
  requestHandler name=/nutch class=solr.SearchHandler 
  
@@ -93, +89 @@

  
  /requestHandler
  
- 6. Start Solr
+ '''6.''' Start Solr
  
  cd apache-solr-1.3.0/example
  java -jar start.jar
  
- 7. Configure Nutch
+ '''7. Configure Nutch'''
  
  a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s 
contents with the following (we specify our crawler name, active plugins and 
limit maximum url count for single host per run to be 100) :
  
  ?xml version=1.0?
  configuration
+ 
  property
+ 
  namehttp.agent.name/name
+ 
  valuenutch-solr-integration/value
+ 
  /property
+ 
  property
  namegenerate.max.per.host/name
+ 
  value100/value
+ 
  /property
+ 
  property
+ 
  nameplugin.includes/name
+ 
  
valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value
+ 
  /property
+ 
  /configuration
  
+ 
- b. Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,
+ '''b.''' Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,replace 
it’s content with following:
- replace it’s content with following:
  
  -^(https|telnet|file|ftp|mailto):
   
@@ -135, +143 @@

  # deny anything else
  -.
  
- 8. Create a seed list (the initial urls to fetch)
+ '''8.''' Create a seed list (the initial urls to fetch)
  
  mkdir urls
  echo http://www.lucidimagination.com/;  urls/seed.txt
  
- 9. Inject seed url(s) to nutch crawldb (execute in nutch directory)
+ '''9.''' Inject seed url(s) to nutch crawldb (execute in nutch directory)
  
  bin/nutch inject crawl/crawldb urls
  
- 10. Generate fetch list, fetch and parse content
+ '''10.''' Generate fetch list, fetch and parse content
  
  bin/nutch generate crawl/crawldb crawl/segments
  
@@ -166, +174 @@

  
  Now a full Fetch cycle is completed. Next you can repeat step 10 couple of 
more times to get some more content.
  
- 11. Create linkdb
+ '''11.''' Create linkdb
  
  bin/nutch invertlinks crawl/linkdb -dir crawl/segments
  
- 12. Finally index all content from all segments to Solr
+ '''12.''' Finally index all content from all segments to Solr
  
  bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb 
crawl/segments/*
  


[Nutch Wiki] Update of FrontPage by PalashRay

2009-04-29 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by PalashRay:
http://wiki.apache.org/nutch/FrontPage

--
   * ErrorMessages -- What they mean and suggestions for getting rid of them.
   * SetupProxyForNutch - using Tinyproxy on Ubuntu
   * CreateNewFilter - for example to add a category metadata to your index and 
be able to search for it
+  * HowToMakeCustomSearch
   * UpgradeFrom07To08
   * [Upgrading_from_0.8.x_to_0.9]
   * RunNutchInEclipse for v0.8


[Nutch Wiki] Update of HowToMakeCustomSearch by PalashRay

2009-04-29 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by PalashRay:
http://wiki.apache.org/nutch/HowToMakeCustomSearch

New page:
[This is for Nutch 1.0]

How do you index your custom data and then search for the same using the Nutch 
web interface? Suppose we want to search for the author of the website by his 
email id. 

== Indexing the email id ==

Before we can search for our custom data, we need to index it. Nutch has a 
plugin architecture very similar to that of Eclipse. We can write our own 
plugin for indexing. Here is the source code:

{{{

package com.swayam.nutch.plugins.indexfilter;
 
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.Inlinks;
import org.apache.nutch.indexer.IndexingException;
import org.apache.nutch.indexer.IndexingFilter;
import org.apache.nutch.indexer.NutchDocument;
import org.apache.nutch.indexer.lucene.LuceneWriter;
import org.apache.nutch.parse.Parse;
 
/**
 *...@author paawak
 */
public class EmailIndexingFilter implements IndexingFilter {
 
private static final Log LOG = LogFactory.getLog(EmailIndexingFilter.class);
 
private static final String KEY_CREATOR_EMAIL = email;
 
private Configuration conf;
 
 
public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
CrawlDatum datum, Inlinks inlinks) throws IndexingException {
 
// look up email of the author based on the url of the site
String creatorEmail = EmailLookup.getCreatorEmail(url.toString());
 
LOG.info( creatorEmail =  + creatorEmail);
 
if (creatorEmail != null) {
doc.add(KEY_CREATOR_EMAIL, creatorEmail);
}
 
return doc;
}
 
public void addIndexBackendOptions(Configuration conf) {
 
LuceneWriter.addFieldOptions(KEY_CREATOR_EMAIL, LuceneWriter.STORE.YES,
LuceneWriter.INDEX.TOKENIZED, conf);
 
}
 
public Configuration getConf() {
return conf;
}
 
public void setConf(Configuration conf) {
this.conf = conf;
}
 
}

}}}


Also, you need to create a ''plugin.xml'':


{{{
plugin id=index-email name=Email Indexing Filter version=1.0.0
  provider-name=swayam
 
  runtime
library name=EmailIndexingFilterPlugin.jar
  export name=* /
/library
  /runtime
 
  requires
import plugin=nutch-extensionpoints /
  /requires
 
  extension id=com.swayam.nutch.plugins.indexfilter.EmailIndexingFilter
name=Email Indexing Filter
point=org.apache.nutch.indexer.IndexingFilter
implementation id=index-email
  class=com.swayam.nutch.plugins.indexfilter.EmailIndexingFilter /
  /extension
 
/plugin
}}}

This done, create a new folder in the ''$NUTCH_HOME/plugins'' and put the jar 
and the plugin.xml there.

Now we have to activate this plugin. To do this, we have to edit the 
''conf/nutch-site.xml''.

{{{

property
  nameplugin.includes/name
  
valuenutch-extensionpoints|protocol-http|parse-(text|html)|index-(basic|email)|query-(basic|site|url)/value
  descriptionRegular expression naming plugin id names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  /description
/property

}}}


== Now, how do I search my indexed data? ==
=== Option 1 [cumbersome]: ===

Add my own query plugin:

{{{

package com.swayam.nutch.plugins.queryfilter;
 
import org.apache.nutch.searcher.FieldQueryFilter;
 
/**
 *...@author paawak
 */
public class MyEmailQueryFilter extends FieldQueryFilter {
 
public MyEmailQueryFilter() {
super(email);
}
 
}

}}}

Do not forget to edit the plugin.xml.


{{{

plugin
   id=query-email
   name=Email Query Filter
   version=1.0.0
   provider-name=swayam
 
   runtime
  library name=EmailQueryFilterPlugin.jar
 export name=*/
  /library
   /runtime
 
   requires
  import plugin=nutch-extensionpoints/
   /requires
 
   extension id=com.swayam.nutch.plugins.queryfilter.MyEmailQueryFilter
  name=Email Query Filter
  point=org.apache.nutch.searcher.QueryFilter
  implementation id=query-email
  
class=com.swayam.nutch.plugins.queryfilter.MyEmailQueryFilter
parameter name=fields value=email/
  /implementation
 
   /extension
 
/plugin

}}}


This line is particularly important:

'parameter name=”fields” value=”email”/'

If you skip this line, you will never be able to see this in search results.

The only catch here is you have to append the keyword email: to the search key. 
For example, if you want to search for jsm...@mydomain.com, you have to 

[Nutch Wiki] Update of HowToMakeCustomSearch by PalashRay

2009-04-29 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by PalashRay:
http://wiki.apache.org/nutch/HowToMakeCustomSearch

--
  [This is for Nutch 1.0]
  
- How do you index your custom data and then search for the same using the 
Nutch web interface? Suppose we want to search for the author of the website by 
his email id. 
+ How do you index your custom data and then search for the same using the 
Nutch web interface? 
+ 
+ == Use Case ==
+ 
+ Suppose we want to search for the author of the website by his email id. 
  
  == Indexing the email id ==
  
@@ -184, +188 @@

  
  If you skip this line, you will never be able to see this in search results.
  
- The only catch here is you have to append the keyword email: to the search 
key. For example, if you want to search for jsm...@mydomain.com, you have to 
search for email:jsm...@mydomain.com or email:jsmith.
+ The only catch here is you have to prepend the keyword 'email:' to 
the search key. For example, if you want to search for 
'jsm...@mydomain.com', you have to search for 
'email:jsm...@mydomain.com' or 'email:jsmith'.
  
  There is an easier and more elegant way.
  
@@ -224, +228 @@

  
  }}}
  
- With this while looking for jsm...@mydomain.com, you can simply enter 
jsm...@mydomain.com or a part the name like jsmit.
+ With this while looking for 'jsm...@mydomain.com', you can simply 
enter 'jsm...@mydomain.com' or a part the name like 'jsmit'.
  
  == Building a Nutch plugin ==
  


[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown

2009-04-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by FrankMcCown:
http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0

The comment on the change is:
For Vista, give cygwin administrative privileges

--
  
  Install cygwin and set the PATH environment variable for it. You can set it 
from the Control Panel, System, Advanced Tab, Environment Variables and 
edit/add PATH.
  
- I have in PATH like:
+ Example PATH:
  {{{
  C:\Sun\SDK\bin;C:\cygwin\bin
  }}}
  If you run bash from the Windows command line (Start  Run...  cmd.exe) it 
should successfully run cygwin.
  
- If you are running Eclipse on Vista, you will likely need to 
[http://www.mydigitallife.info/2006/12/19/turn-off-or-disable-user-account-control-uac-in-windows-vista/
 turn off Vista's User Access Control (UAC)]. Otherwise Hadoop will likely 
complain that it cannot change a directory permission when you later run the 
crawler:
+ If you are running Eclipse on Vista, you will need to either give cygwin 
administrative privileges or 
[http://www.mydigitallife.info/2006/12/19/turn-off-or-disable-user-account-control-uac-in-windows-vista/
 turn off Vista's User Access Control (UAC)]. Otherwise Hadoop will likely 
complain that it cannot change a directory permission when you later run the 
crawler:
  {{{
  org.apache.hadoop.util.Shell$ExitCodeException: chmod: changing permissions 
of ... Permission denied
  }}}


[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown

2009-04-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by FrankMcCown:
http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0

The comment on the change is:
Removed install of whoami in Windows (cygwin's whoami is used)

--
  
  == Before you start ==
  
+ Setting up Nutch to run into Eclipse can be tricky, and most of the time it 
is much faster if you edit Nutch in Eclipse but run the scripts from the 
command line (my 2 cents). However, it's very useful to be able to debug Nutch 
in Eclipse. Sometimes examining the logs (logs/hadoop.log) is quicker to debug 
a problem.
- Setting up Nutch to run into Eclipse can be tricky, and most of the time you 
are much faster if you edit Nutch in Eclipse but run the scripts from the 
command line (my 2 cents).
- However, it's very useful to be able to debug Nutch in Eclipse. But again you 
might be quicker by looking at the logs (logs/hadoop.log)...
  
  
  == Steps ==
@@ -34, +33 @@

  
  C:\Sun\SDK\bin;C:\cygwin\bin
  
- If you run bash in Start-RUN-cmd.exe it should work. 
+ If you run bash in Start  Run...  cmd.exe it should work. 
  
- 
- Then you should install tools from Microsoft website (adding 'whoami' 
command).
- 
- Example for Windows XP and sp2
- 
- 
http://www.microsoft.com/downloads/details.aspx?FamilyId=49AE8576-9BB9-4126-9761-BA8011FABF38displaylang=en
- 
- 
- Then you can follow rest of these steps
  
  === Install Nutch ===
   * Grab a fresh release of Nutch 0.9 - 
http://lucene.apache.org/nutch/version_control.html


[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown

2009-04-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by FrankMcCown:
http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0

The comment on the change is:
Add link to official release

--
  
  
  === Install Nutch ===
-  * Grab a fresh release of Nutch 0.9 - 
http://lucene.apache.org/nutch/version_control.html
+  * Grab a [http://lucene.apache.org/nutch/version_control.html fresh release] 
of Nutch 1.0 or download and untar the [http://lucene.apache.org/nutch/release/ 
official 1.0 release]. 
-  * Do not build Nutch now. Make sure you have no .project and .classpath 
files in the Nutch directory
+  * Do not build Nutch yet. Make sure you have no .project and .classpath 
files in the Nutch directory
  
  
  === Create a new java project in Eclipse ===


[Nutch Wiki] Trivial Update of RunNutchInEclipse1.0 by FrankMcCown

2009-04-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by FrankMcCown:
http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0

The comment on the change is:
Clarified some instructions and improved grammar

--
   * Do not build Nutch yet. Make sure you have no .project and .classpath 
files in the Nutch directory
  
  
- === Create a new java project in Eclipse ===
+ === Create a new Java Project in Eclipse ===
   * File  New  Project  Java project  click Next
   * Name the project (Nutch_Trunk for instance)
   * Select Create project from existing source and use the location where 
you downloaded Nutch
   * Click on Next, and wait while Eclipse is scanning the folders
-  * Add the folder conf to the classpath (third tab and then add class 
folder) 
+  * Add the folder conf to the classpath (click the Libraries tab, click 
Add Class Folder... button, and select conf from the list) 
-  * Go to Order and Export tab, find the entry for added conf folder and 
move it to the top. It's required to make eclipse take config 
(nutch-default.xml, nutch-final.xml, etc.) resources from our conf folder not 
anywhere else.
+  * Go to Order and Export tab, find the entry for added conf folder and 
move it to the top (by checking it and clicking the Top button). This is 
required so Eclipse will take config (nutch-default.xml, nutch-final.xml, etc.) 
resources from our conf folder and not from somewhere else.
-  * Eclipse should have guessed all the java files that must be added on your 
classpath. If it's not the case, add src/java, src/test and all plugin 
src/java and src/test folders to your source folders. Also add all jars in 
lib and in the plugin lib folders to your libraries 
-  * Set output dir to tmp_build, create it if necessary
+  * Eclipse should have guessed all the Java files that must be added to your 
classpath. If that's not the case, add src/java, src/test and all plugin 
src/java and src/test folders to your source folders. Also add all jars in 
lib and in the plugin lib folders to your libraries 
+  * Click the Source tab and set the default output folder to 
Nutch_Trunk/bin/tmp_build. (You may need to create the tmp_build folder.)
+  * Click the Finish button
   * DO NOT add build to classpath
  
  


[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown

2009-04-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by FrankMcCown:
http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0

The comment on the change is:
Added fix for RTFParseFactory issues

--
  Copy the jar files into src/plugin/parse-mp3/lib and 
src/plugin/parse-rtf/lib/ respectively.
  Then add the jar files to the build path (First refresh the workspace by 
pressing F5. Then right-click the project folder  Build Path  Configure Build 
Path...  Then select the Libraries tab, click Add Jars... and then add each 
.jar file individually).
  
+ === Two Errors with RTFParseFactory ===
+ 
+ If you are trying to build the official 1.0 release, Eclipse will complain 
about 2 errors regarding the RTFParseFactory (this is after adding the RTF jar 
file from the previous step).  This problem was fixed (see 
[http://issues.apache.org/jira/browse/NUTCH-644 NUTCH-644] and 
[http://issues.apache.org/jira/browse/NUTCH-705 NUTCH-705]) but was not 
included in the 1.0 official release because of licensing issues. So you will 
need to manually alter the code to remove these 2 build errors.
+ 
+ In RTFParseFactory.java:
+  1. Add the following import statement: {{{import 
org.apache.nutch.parse.ParseResult;}}}
+ 
+  2. Change 
+ 
+ {{{
+ public Parse getParse(Content content) {
+ }}}
+ to
+ {{{
+ public ParseResult getParse(Content content) {
+ }}}
+  1.#3 In the getParse function, replace
+ {{{
+ return new ParseStatus(ParseStatus.FAILED,
+ParseStatus.FAILED_EXCEPTION,
+e.toString()).getEmptyParse(conf);
+ }}}
+ with
+ {{{
+ return new ParseStatus(ParseStatus.FAILED,
+ ParseStatus.FAILED_EXCEPTION,
+   e.toString()).getEmptyParseResult(content.getUrl(), getConf());
+ }}}
+  1.#4 In the getParse function, replace
+ {{{
+ return new ParseImpl(text,
+  new ParseData(ParseStatus.STATUS_SUCCESS,
+title,
+OutlinkExtractor.getOutlinks(text, 
this.conf),
+content.getMetadata(),
+metadata));
+ }}}
+ with
+ {{{
+ return ParseResult.createParseResult(content.getUrl(),
+new ParseImpl(text,
+new ParseData(ParseStatus.STATUS_SUCCESS,
+title,
+OutlinkExtractor.getOutlinks(text, 
this.conf),
+content.getMetadata(),
+metadata)));
+ 
+ }}}
+ 
+ In TestRTFParser.java, replace
+ {{{
+ parse = new ParseUtil(conf).parseByExtensionId(parse-rtf, content);
+ }}}
+ with
+ {{{
+ parse = new ParseUtil(conf).parseByExtensionId(parse-rtf, 
content).get(urlString);
+ }}}
+ 
+ Once you have made these changes and saved the files, Eclipse should build 
with no errors.
  
  === Build Nutch ===
  If you setup the project correctly, Eclipse will build Nutch for you into 
tmp_build. See below for problems you could run into.


[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown

2009-04-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by FrankMcCown:
http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0

The comment on the change is:
In case you forget to add cygwin to path

--
* add the hadoop project as a dependent project of nutch project 
* you can now also set break points within hadoop classes lik inputformat 
implementations etc. 
  
+ 
+ === Failed to get the current user's information ===
+ 
+ On Windows, if the crawler throws an exception complaining it Failed to get 
the current user's information or 'Login failed: Cannot run program bash', 
it is likely you forgot to set the PATH to point to cygwin.  Open a new command 
line window (All Programs  Accessories  Command Prompt) and type bash.  
This should start cygwin.  If it doesn't, type path to see your path. You 
should see within the path the cygwin bin directory (e.g., C:\cygwin\bin). See 
the steps to adding this to your PATH at the top of the article under For 
Windows Users.
+ 
+ 
  Original credits: RenaudRichardet
  


[Nutch Wiki] Trivial Update of RunNutchInEclipse1.0 by FrankMcCown

2009-04-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by FrankMcCown:
http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0

The comment on the change is:
Restart Eclipse after setting PATH

--
  
  === Failed to get the current user's information ===
  
- On Windows, if the crawler throws an exception complaining it Failed to get 
the current user's information or 'Login failed: Cannot run program bash', 
it is likely you forgot to set the PATH to point to cygwin.  Open a new command 
line window (All Programs  Accessories  Command Prompt) and type bash.  
This should start cygwin.  If it doesn't, type path to see your path. You 
should see within the path the cygwin bin directory (e.g., C:\cygwin\bin). See 
the steps to adding this to your PATH at the top of the article under For 
Windows Users.
+ On Windows, if the crawler throws an exception complaining it Failed to get 
the current user's information or 'Login failed: Cannot run program bash', 
it is likely you forgot to set the PATH to point to cygwin.  Open a new command 
line window (All Programs  Accessories  Command Prompt) and type bash.  
This should start cygwin.  If it doesn't, type path to see your path. You 
should see within the path the cygwin bin directory (e.g., C:\cygwin\bin). See 
the steps to adding this to your PATH at the top of the article under For 
Windows Users. After setting the PATH, you will likely need to restart Eclipse 
so it will use the new PATH.
  
  
  Original credits: RenaudRichardet


[Nutch Wiki] Update of RunNutchInEclipse1.0 by FrankMcCown

2009-04-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by FrankMcCown:
http://wiki.apache.org/nutch/RunNutchInEclipse1%2e0

The comment on the change is:
Moved heap problem to location of other problems

--
   * if all works, you should see Nutch getting busy at crawling :-)
  
  
- == Java Heap Size problem ==
- 
- If you find in hadoop.log line similar to this:
- 
- {{{
- 2009-04-13 13:41:06,105 WARN  mapred.LocalJobRunner - job_local_0001
- java.lang.OutOfMemoryError: Java heap space
- }}}
- 
- You should increase amount of RAM for running applications from eclipse.
- 
- Just set it in:
- 
- Eclipse - Window - Preferences - Java - Installed JREs - edit - Default 
VM arguments
- 
- I've set mine to 
- {{{
- -Xms5m -Xmx150m 
- }}}
- because I have like 200MB RAM left after runnig all apps
- 
- -Xms (minimum ammount of RAM memory for running applications)
- -Xmx (maximum) 
- 
  == Debug Nutch in Eclipse (not yet tested for 0.9) ==
   * Set breakpoints and debug a crawl
   * It can be tricky to find out where to set the breakpoint, because of the 
Hadoop jobs. Here are a few good places to set breakpoints:
@@ -195, +171 @@

  == If things do not work... ==
  Yes, Nutch and Eclipse can be a difficult companionship sometimes ;-)
  
+ === Java Heap Size problem ===
+ 
+ If the crawler throws an IOException exception early in the crawl (Exception 
in thread main java.io.IOException: Job failed!), check the logs/hadoop.log 
file for further information. If you find in hadoop.log lines similar to this:
+ 
+ {{{
+ 2009-04-13 13:41:06,105 WARN  mapred.LocalJobRunner - job_local_0001
+ java.lang.OutOfMemoryError: Java heap space
+ }}}
+ 
+ then you should increase amount of RAM for running applications from Eclipse.
+ 
+ Just set it in:
+ 
+ Eclipse - Window - Preferences - Java - Installed JREs - edit - Default 
VM arguments
+ 
+ I've set mine to 
+ {{{
+ -Xms5m -Xmx150m 
+ }}}
+ because I have like 200MB RAM left after running all apps
+ 
+ -Xms (minimum ammount of RAM memory for running applications)
+ -Xmx (maximum) 
+ 
- === eclipse: Cannot create project content in workspace ===
+ === Eclipse: Cannot create project content in workspace ===
  The nutch source code must be out of the workspace folder. My first attempt 
was download the code with eclipse (svn) under my workspace. When I try to 
create the project using existing code, eclipse don't let me do it from source 
code into the workspace. I use the source code out of my workspace and it work 
fine.
  
  === plugin dir not found ===


  1   2   >