date:20080512

[Nutch Wiki] Update of PublicServers by Finbar Dineen

2008-05-12 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by Finbar Dineen:
http://wiki.apache.org/nutch/PublicServers

--
* [http://www.bigsearch.ca/ Bigsearch.ca] uses nutch open source software 
to deliver its search results.
  
* [http://busytonight.com/ BusyTonight]: Search for any event in the United 
States, by keyword, location, and date. Event listings are automatically 
crawled and updated from original source Web sites.
+ 
+   * [http://www.centralbudapest.com/search Central Budapest Search] is a 
search engine for English language sites focussing on Budapest news, 
restaurants, accommodation, life and events.

* [http://circuitscout.com Circuit Scout] is a search engine for electrical 
circuits.

Re: Writing a plugin

2008-05-12 Thread Pau

Hi,
I have added my plugin (called recommended) to nutch-site.xml but it seems
that Nutch is not using it.
I say this because when search for recom I get no results, but there is a
page that has the meta-tag:
meta name=recommended content=recom/

I have attached my nutch-site.xml and nutch-default.xml files, maybe you see
something wrong.
Apart from that, my plugin compiles ok, but when I run ant test I get
errors. I have also attached the output for ant test.

On Sun, May 11, 2008 at 8:08 PM, [EMAIL PROTECTED] wrote:

Hi,

Yes, you have to add your plugin to nutch-site.xml, along with other
plugins you probably already have defined there. If you don't have them in
nutch-site.xml, look at nutch-default.xml

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message
From: Pau [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Sunday, May 11, 2008 8:28:53 AM
Subject: Writing a plugin

Hello,
I am following the WritingPluginExample-0.9 and I am a bit confused
about
how to get nutch to use my plugin.
In the section called Getting Ant to Compile Your Plugin it says:
The next time you run a crawl your parser and index filter should get
used.
But at the end of the document, there is another section called Getting
Nutch to Use Your Plugin.
Do I have to edit the nutch-site.xml file as Getting Nutch to Use Your
Plugin says? Or it is not necessary?
Thank you.

?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?

!-- Put site-specific property overrides in this file. --

configuration
property
namehttp.agent.name/name
valuePauSpider/value
descriptionHTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version

and set their values appropriately.

/description
/property

property
namehttp.agent.description/name
valueNutch Crawler/value
descriptionFurther description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
/description
/property

property
namehttp.agent.email/name
value[EMAIL PROTECTED]/value
descriptionDescription/description
/property

property
nameplugin.includes/name
valuerecommended|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value
descriptionRegular expression naming plugin id names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
/description
/property
/configuration
?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?
!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the License); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an AS IS BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--
!-- Do not modify this file directly. Instead, copy entries that you --
!-- wish to modify from this file into nutch-site.xml and change them --
!-- there. If nutch-site.xml does not already exist, create it. --

configuration

!-- file properties --

property
namefile.content.limit/name
value65536/value
descriptionThe length limit for downloaded content, in bytes.
If this value is nonnegative (=0), content longer than it will be truncated;
otherwise, no truncation at all.
/description
/property

property
namefile.content.ignored/name
valuetrue/value
descriptionIf true, no file content will be saved during fetch.
And it is probably what we want to set most of time, since file:// URLs
are meant to be local and we can always use them directly at parsing
and indexing stages. Otherwise file contents will be saved.
!! NO IMPLEMENTED YET !!
/description
/property

!-- HTTP properties --

property
namehttp.agent.name/name
value/value
descriptionHTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

http.robots.agents

[Nutch Wiki] Update of PublicServers by Finbar Dineen

Re: Writing a plugin

2 matches

Site Navigation

Mail list logo

Footer information