Try querying with both the date and something you'd expect to find in the
content. The field query filter is just a filter. It only restricts your
results to things that match the basic query and has the contents you require
in the field. So if you query for date:2006080 text you'll be
Try making an empty directory call
C:\nutch\nutch-0.7.2\src\plugin\nutch-extensionpoints\src\java and see
if that helps.
-Original Message-
From: Dagum, Leo [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 15, 2006 5:03 PM
To: nutch-user@lucene.apache.org
Subject: RE: nutch .72
?
Or do I need to create a seperate plugin for that - basically exactly
what i had before with fields=fieldname in the plugin.xml?
Thanks a ton.
Scott
On 2/8/06, Vanderdray, Jacob [EMAIL PROTECTED] wrote:
Scott,
You probably just need to edit your plugin.xml file to set
fields
What open source tools do people like for analyzing nutch search
log files? I'm specifically looking to find out most frequent search
terms. The reports are for internal consumption to help understand what
people are looking for and try to make sure they're finding it.
Thanks,
Jake.
I've added some code to query-basic to log the query after it
has run both addTerms and addPhrases. This helps me to better
understand what's going on. I've noticed that when my search contains
words like the or a, those don't appear in the actual query.
It looks to me like the
Vanderdray, Jacob wrote:
I've added some code to query-basic to log the query after it
has run both addTerms and addPhrases. This helps me to better
understand what's going on. I've noticed that when my search contains
words like the or a, those don't appear in the actual query.
It looks
Richard,
I'm not working with .8 right now, so I can't comment too much
on your findings, but if you want to propose changes to the 0.8 tutorial
on the web site check out
http://wiki.apache.org/nutch/Website_Update_HOWTO. Then you can post
your patch to JIRA and if folks agree with your
I'm getting the exception below when trying to crawl an https
site. I'm using IBM's JRE version 1.4.2 on RedHat Enterprise Linux.
This is with a copy of the 0.7 branch pulled from svn this morning.
I'd thought I had https crawling working (using httpclient), but
I haven't been
Dennis,
How 'bout the wiki.
Jake.
-Original Message-
From: Dennis Kubes [mailto:[EMAIL PROTECTED]
Sent: Monday, March 20, 2006 1:01 PM
To: nutch-user@lucene.apache.org
Subject: Nutch and Hadoop Tutorial Finished
All,
I have finished a lengthy tutorial on how to setup a
: Vanderdray, Jacob [mailto:[EMAIL PROTECTED]
Sent: Monday, March 20, 2006 12:20 PM
To: nutch-user@lucene.apache.org
Subject: RE: Nutch and Hadoop Tutorial Finished
Dennis,
How 'bout the wiki.
Jake.
-Original Message-
From: Dennis Kubes [mailto:[EMAIL PROTECTED]
Sent: Monday
Giang,
Is it possible you switched up the numbers? Shouldn't it be:
N(hello) = 77
N(world) = 894
N(hello world) = 45
N(hello OR world) = 926
If so then I agree that it seems to work. I'd be very interested in
seeing this added back into nutch. The instructions for creating a
-1
I found the instructions for doing an Intranet crawl extremely
helpful for getting up and running quickly. I went back later and
figured out more about what it was actually doing. Perhaps the name
could just be changed to Single Site Crawling with the Nutch Shell
Script and some
the smallest of websites.
The distinction is not only on the scale of the project, but on the
level of control one wants (IMHO). The documentation should at least
give hints in that direction.
Thanks, Frank.
On 3/7/06, Vanderdray, Jacob [EMAIL PROTECTED] wrote:
-1
I found the instructions
.
Common configuration issues:
Maximum Content size:
Maximum Retries.
What else?
I will gladly post these chnages to the Wiki, given some votes of
confidence and other suggestions.
-Original Message-
From: Vanderdray, Jacob [mailto:[EMAIL PROTECTED]
Sent: Tuesday, March 07, 2006 1:52 PM
What would I need to do to extend the BasicQueryFilter? Just
using import org.apache.nutch.searcher.basic.BasicQueryFilter; doesn't
seem be enough.
Thanks,
Jake.
What version of nutch are you working with? I wrote the example
based on the 0.7.1 base.
The second error seems to indicate that you don't have a filter
method in your indexer plugin. Check to make sure there isn't a typo in
the name of the method.
Good luck,
Jake.
If you look at the section of the tutorial for doing intranet
crawls, you should be able to use that for your small number of
websites. The bin/nutch script wraps up all the crawl functions for you
(fetching, indexing, deduping, etc). You'll just need to delete the
results of your
the container (JBoss, in my
case).
Is this a feature inherited from lucene? Or maybe this will be improved
in the future?
Thanks again.
En/na Vanderdray, Jacob ha escrit:
If you look at the section of the tutorial for doing intranet
crawls, you should be able to use that for your small
One difference you'll want is to change the plugin.xml file so
that your query filter gets used just for the fields you're interested
in. Instead of fields=DEFAULT in the example, you'll want
raw-fields=language and raw-fields=category. Assuming you name the
fields language and category
I'm not sure I understand what you're getting at. In this case
I've added a comma separated list of names of meta tags that I want to
index and search against. I've written a parse filter, an index filter
and this query filter that all read in that list of meta tags from the
I've written some extensions that allow you to define meta tags
that you would like included in nutch indexing and searching. The meta
tag names are defined in nutch-site.xml. In general this seems to be
working, but I'm seeing some problems with searching.
I've added the
similar options in crawl-urlfilter.txt. But in my case
these directories do need to be crawled. However, the directory name
should not be indexed as a single document. It's more like we'd have a
file called index-urlfilter.txt.
Thanks,
--Jay
Vanderdray, Jacob wrote:
Jay,
The url field
Jay,
The url field is handled by the query-basic filter. There is a
setting inside conf/nutch-default.xml that controls the weighting
(boost) for that field. You can reduce the influence of this field by
putting a new value in your conf/nutch-site.xml file. You may even be
able to
Subject: Re: Link to Search Interface for List
Vanderdray, Jacob wrote:
I get the same thing from my linux box. The only reference I
can find to linkmap.html is a commented out line in forrest.properties.
FWIW: I've already made the changes to my copy of mailing_lists.xml.
Let me know
From the source of the query-more plugin:
// query syntax is defined as date:mmdd-mmdd
Jake.
-Original Message-
From: Raghavendra Prabhu [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 14, 2006 12:54 PM
To: nutch-user@lucene.apache.org
Subject: format for range date query
Hi
No problem. Check out the Intranet configuration section of
the tutorial (http://lucene.apache.org/nutch/tutorial.html):
Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with
the name of the domain you wish to crawl. For example, if you wished to
limit the crawl
Would it be possible to add a link to
http://www.mail-archive.com/nutch-user%40lucene.apache.org/ on
http://lucene.apache.org/nutch/mailing_lists.html? I'd suggest it goes
as another bullet point in under the Users mailing list section with the
link text Search the List Archives.
If you control the temporary links pages, then just add a
robots meta tag. Take a look at
http://www.robotstxt.org/wc/meta-user.html to see what your options are.
Jake.
-Original Message-
From: Elwin [mailto:[EMAIL PROTECTED]
Sent: Friday, February 10, 2006 4:38 AM
To:
Is anyone else having trouble getting to the Nutch wiki? While
we're on the topic, does anyone know if the wiki gets backed up?
Thanks,
Jake.
it up?
Jeff
Vanderdray, Jacob wrote:
Is anyone else having trouble getting to the Nutch wiki? While
we're on the topic, does anyone know if the wiki gets backed up?
Thanks,
Jake.
Scott,
You probably just need to edit your plugin.xml file to set
fields=DEFAULT. Take a look at
http://wiki.apache.org/nutch/WritingPluginExample and see if that helps.
Jake.
-Original Message-
From: Scott Owens [mailto:[EMAIL PROTECTED]
Sent: Wednesday, February 08, 2006
what i had before with fields=fieldname in the plugin.xml?
Thanks a ton.
Scott
On 2/8/06, Vanderdray, Jacob [EMAIL PROTECTED] wrote:
Scott,
You probably just need to edit your plugin.xml file to set
fields=DEFAULT. Take a look at
http://wiki.apache.org/nutch/WritingPluginExample
If you only want to crawl www.woodward.edu, then change
+^http://([a-z0-9]*\.)*woodward.edu/
To:
+^http://www.woodward.edu/
Jake.
-Original Message-
From: Andy Morris [mailto:[EMAIL PROTECTED]
Sent: Monday, February 06, 2006 9:00 PM
To: nutch-user@lucene.apache.org
Subject:
Enrico,
What does your plugin.xml file look like? Also, what happens if
you do a search for:
quiewId:a3d32ce0cae0da47677f30cc6182d421
[A_word_found_in_the_body_of_the_document]
Jake.
-Original Message-
From: Enrico Triolo [mailto:[EMAIL PROTECTED]
Sent: Tuesday, January 31,
I think you'd need to write a plugin to extract, index and query
the data you're interested in. I don't think that any of the existing
plugins grab this data.
Jake.
-Original Message-
From: Meryl Silverburgh [mailto:[EMAIL PROTECTED]
Sent: Tuesday, January 31, 2006 12:22 AM
To:
In the plugin.xml file bellow, try changing:
implementation id=URLQueryFilter
class=
org.apache.nutch.searcher.quiew.QuiewQueryFilter
fields=url/
To:
implementation id=URLQueryFilter
class=
One other note: with the fields set to DEFAULT, you'll want to
just query for a3d32ce0cae0da47677f30cc6182d421 instead of
quiewId:a3d32ce0cae0da47677f30cc6182d421.
Jake.
-Original Message-
From: Vanderdray, Jacob [mailto:[EMAIL PROTECTED]
Sent: Tuesday, January 31, 2006 11:07 AM
If you want to force that to be part of every query, you could
edit search.jsp to add something like:
queryString = queryString + type:pdf lang:en;
If you want it to be the default, but let the user change it,
I'd add fields to the search form and make the default be
What's jira? I'm actually in the process of writing a set of
plugins to process meta tags that you define in the nutch-site.xml file,
so I'd be interested in reading about what's being worked on.
Thanks,
Jake.
-Original Message-
From: Stefan Groschupf [mailto:[EMAIL PROTECTED]
I went through and broke up the wiki page on writing plugins
into a general page explaining what plugins are
(http://wiki.apache.org/nutch/AboutPlugins) and a step-by-step example
of writing a plugin (http://wiki.apache.org/nutch/WritingPluginExample).
Feel free to edit and correct.
ant.apache.org instead of using an rpm.
Thanks,
Jake.
-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Saturday, December 03, 2005 1:32 PM
To: nutch-user@lucene.apache.org
Subject: Re: Class Not Found
Vanderdray, Jacob wrote:
I installed ant from an rpm
Sure. So far I've gotten around this by using the pre-built
ROOT.war from the 0.7 distribution and replacing the files I've added
and changed. I'm planning on installing ant from source, but haven't
gotten around to it yet.
Thanks,
Jake.
$ ant war
Buildfile: build.xml
init:
When I try building the war file from source, I'm getting an
error about some ant stuff that I must not have installed. Does anyone
know what package I need to get this fixed?
java.lang.ClassNotFoundException:
org.apache.tools.ant.taskdefs.optional.XslpLiaison
Thanks,
Jake.
, check that you are not
using ant 1.5.
Stefan
Am 30.11.2005 um 20:30 schrieb Vanderdray, Jacob:
When I try building the war file from source, I'm getting an
error about some ant stuff that I must not have installed. Does
anyone
know what package I need to get this fixed
Try
src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/FileRes
ponse.java
Jake.
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Monday, November 28, 2005 12:51 PM
To: nutch-user@lucene.apache.org
Subject:
Just run 'ant' in the top directory of your downloaded source. There
should be a build.xml file in there that tells ant what to do.
Jake.
-Original Message-
From: Victor Lee [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 22, 2005 4:13 PM
To: nutch-user@lucene.apache.org
Subject: How
I've written a plugin that adds a new field to the index. When
I look at the explanation page for a given search result I can see my
field listed. How do I get the searcher to do searches against the new
field? I'm reading through the code in
src/java/org/apache/nutch/searcher and I
I can do that. Thanks!
Jake.
-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 22, 2005 5:37 PM
To: nutch-user@lucene.apache.org
Subject: Re: Adding Field to be Searched
Vanderdray, Jacob wrote:
I've written a plugin that adds a new
Dean,
I'm not sure if the nutch crawler actually supports it, but you
should be able to use a robots noindex Meta tag in the archive pages.
See http://www.robotstxt.org/wc/meta-user.html for more information.
Jake.
-Original Message-
From: Dean Elwood [mailto:[EMAIL PROTECTED]
Thanks. I just swapped it in for the front page. I also found that
pdf and fixed the link (http://wiki.apache.org/nutch/Evaluations).
Jake.
-Original Message-
From: Jérôme Charron [mailto:[EMAIL PROTECTED]
Sent: Sunday, November 13, 2005 4:09 PM
To:
I've started work on documenting how to write a plugin. I'm
writing it as I figure it out myself so any corrections or pointers are
appreciated.
http://wiki.apache.org/nutch/WritingPlugins
Thanks,
Jake.
Bill,
Take a look at search.jsp. It looks like it sets the base href
based on the name you hit it as. Since you're proxying back to
localhost:8080, that's what it sets as the base href. I think you
should be able to just hardcode that to be rfe.org, restart tomcat and
it should work.
/de/
Stefan
Am 10.11.2005 um 20:34 schrieb Vanderdray, Jacob:
Stefan,
Do you mind if I take the pages you do have (Which technical
concepts are behind the nutch plugin system, etc.) and put them into
the nutch wiki as separate pages? I'd like to link to them from
http
10.11.2005 um 21:48 schrieb Vanderdray, Jacob:
Stefan,
I'm in complete agreement that it doesn't make sense to edit the
same stuff in two places. In reading through what you've written, it
looks really good, so at a minimum I'll add links to it form the nutch
wiki. Would you consider
forward. For what its worth I get the same error if I
use the source in the trunk instead of the 0.7 branch.
Jake.
-Original Message-
From: Vanderdray, Jacob [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 08, 2005 4:55 PM
To: nutch-user@lucene.apache.org
Subject: Compilation Errors
I just used svn to grab the nutch-0.7 branch and I get the error
bellow when I try to use ant. This is on a Mac (OS 10.4.3). I'm
guessing I just need to uncomment some defaults, but any pointers would
be appreciated.
Thanks,
Jake.
compile-core:
[javac] Compiling 247 source files to
56 matches
Mail list logo