[Nutch Wiki] Update of PublicServers by Finbar Dineen

2008-05-12 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by Finbar Dineen:
http://wiki.apache.org/nutch/PublicServers

--
* [http://www.bigsearch.ca/ Bigsearch.ca] uses nutch open source software 
to deliver its search results.
  
* [http://busytonight.com/ BusyTonight]: Search for any event in the United 
States, by keyword, location, and date. Event listings are automatically 
crawled and updated from original source Web sites.
+ 
+   * [http://www.centralbudapest.com/search Central Budapest Search] is a 
search engine for English language sites focussing on Budapest news, 
restaurants, accommodation, life and events.

* [http://circuitscout.com Circuit Scout] is a search engine for electrical 
circuits.
  


Re: Writing a plugin

2008-05-12 Thread Pau
Hi,
I have added my plugin (called recommended) to nutch-site.xml but it seems
that Nutch is not using it.
I say this because when search for recom I get no results, but there is a
page that has the meta-tag:
meta name=recommended content=recom/

I have attached my nutch-site.xml and nutch-default.xml files, maybe you see
something wrong.
Apart from that, my plugin compiles ok, but when I run ant test I get
errors. I have also attached the output for ant test.

On Sun, May 11, 2008 at 8:08 PM, [EMAIL PROTECTED] wrote:

 Hi,

 Yes, you have to add your plugin to nutch-site.xml, along with other
 plugins you probably already have defined there.  If you don't have them in
 nutch-site.xml, look at nutch-default.xml

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


 - Original Message 
  From: Pau [EMAIL PROTECTED]
  To: nutch-dev@lucene.apache.org
  Sent: Sunday, May 11, 2008 8:28:53 AM
  Subject: Writing a plugin
 
  Hello,
  I am following the WritingPluginExample-0.9 and I am a bit confused
 about
  how to get nutch to use my plugin.
  In the section called Getting Ant to Compile Your Plugin it says:
  The next time you run a crawl your parser and index filter should get
  used.
  But at the end of the document, there is another section called Getting
  Nutch to Use Your Plugin.
  Do I have to edit the nutch-site.xml file as Getting Nutch to Use Your
  Plugin says? Or it is not necessary?
  Thank you.


?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?

!-- Put site-specific property overrides in this file. --

configuration
property
  namehttp.agent.name/name
  valuePauSpider/value
  descriptionHTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

	http.robots.agents
	http.agent.description
	http.agent.url
	http.agent.email
	http.agent.version

  and set their values appropriately.

  /description
/property

property
  namehttp.agent.description/name
  valueNutch Crawler/value
  descriptionFurther description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  /description
/property

property
  namehttp.agent.email/name
  value[EMAIL PROTECTED]/value
  descriptionDescription/description
/property

property
  nameplugin.includes/name
  valuerecommended|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value
  descriptionRegular expression naming plugin id names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  /description
/property
/configuration
?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?
!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the License); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an AS IS BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
--
!-- Do not modify this file directly.  Instead, copy entries that you --
!-- wish to modify from this file into nutch-site.xml and change them --
!-- there.  If nutch-site.xml does not already exist, create it.  --

configuration

!-- file properties --

property
  namefile.content.limit/name
  value65536/value
  descriptionThe length limit for downloaded content, in bytes.
  If this value is nonnegative (=0), content longer than it will be truncated;
  otherwise, no truncation at all.
  /description
/property

property
  namefile.content.ignored/name
  valuetrue/value
  descriptionIf true, no file content will be saved during fetch.
  And it is probably what we want to set most of time, since file:// URLs
  are meant to be local and we can always use them directly at parsing
  and indexing stages. Otherwise file contents will be saved.
  !! NO IMPLEMENTED YET !!
  /description
/property

!-- HTTP properties --

property
  namehttp.agent.name/name
  value/value
  descriptionHTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

	http.robots.agents