RE: Querying Fields

2006-08-02 Thread Vanderdray, Jacob
Try querying with both the date and something you'd expect to find in the content. The field query filter is just a filter. It only restricts your results to things that match the basic query and has the contents you require in the field. So if you query for date:2006080 text you'll be

RE: nutch .72 out-of-the-box build issue

2006-06-19 Thread Vanderdray, Jacob
Try making an empty directory call C:\nutch\nutch-0.7.2\src\plugin\nutch-extensionpoints\src\java and see if that helps. -Original Message- From: Dagum, Leo [mailto:[EMAIL PROTECTED] Sent: Thursday, June 15, 2006 5:03 PM To: nutch-user@lucene.apache.org Subject: RE: nutch .72

RE: [Nutch-general] RE: boosting custom field values in scoring algorithm

2006-04-07 Thread Vanderdray, Jacob
? Or do I need to create a seperate plugin for that - basically exactly what i had before with fields=fieldname in the plugin.xml? Thanks a ton. Scott On 2/8/06, Vanderdray, Jacob [EMAIL PROTECTED] wrote: Scott, You probably just need to edit your plugin.xml file to set fields

Log Analysis

2006-03-31 Thread Vanderdray, Jacob
What open source tools do people like for analyzing nutch search log files? I'm specifically looking to find out most frequent search terms. The reports are for internal consumption to help understand what people are looking for and try to make sure they're finding it. Thanks, Jake.

Common Terms

2006-03-30 Thread Vanderdray, Jacob
I've added some code to query-basic to log the query after it has run both addTerms and addPhrases. This helps me to better understand what's going on. I've noticed that when my search contains words like the or a, those don't appear in the actual query. It looks to me like the

RE: Common Terms

2006-03-30 Thread Vanderdray, Jacob
Vanderdray, Jacob wrote: I've added some code to query-basic to log the query after it has run both addTerms and addPhrases. This helps me to better understand what's going on. I've noticed that when my search contains words like the or a, those don't appear in the actual query. It looks

RE: .8 Searching

2006-03-27 Thread Vanderdray, Jacob
Richard, I'm not working with .8 right now, so I can't comment too much on your findings, but if you want to propose changes to the 0.8 tutorial on the web site check out http://wiki.apache.org/nutch/Website_Update_HOWTO. Then you can post your patch to JIRA and if folks agree with your

https crawl excepetion with 0.7 branch

2006-03-23 Thread Vanderdray, Jacob
I'm getting the exception below when trying to crawl an https site. I'm using IBM's JRE version 1.4.2 on RedHat Enterprise Linux. This is with a copy of the 0.7 branch pulled from svn this morning. I'd thought I had https crawling working (using httpclient), but I haven't been

RE: Nutch and Hadoop Tutorial Finished

2006-03-20 Thread Vanderdray, Jacob
Dennis, How 'bout the wiki. Jake. -Original Message- From: Dennis Kubes [mailto:[EMAIL PROTECTED] Sent: Monday, March 20, 2006 1:01 PM To: nutch-user@lucene.apache.org Subject: Nutch and Hadoop Tutorial Finished All, I have finished a lengthy tutorial on how to setup a

RE: Nutch and Hadoop Tutorial Finished

2006-03-20 Thread Vanderdray, Jacob
: Vanderdray, Jacob [mailto:[EMAIL PROTECTED] Sent: Monday, March 20, 2006 12:20 PM To: nutch-user@lucene.apache.org Subject: RE: Nutch and Hadoop Tutorial Finished Dennis, How 'bout the wiki. Jake. -Original Message- From: Dennis Kubes [mailto:[EMAIL PROTECTED] Sent: Monday

RE: Boolean OR QueryFilter

2006-03-16 Thread Vanderdray, Jacob
Giang, Is it possible you switched up the numbers? Shouldn't it be: N(hello) = 77 N(world) = 894 N(hello world) = 45 N(hello OR world) = 926 If so then I agree that it seems to work. I'd be very interested in seeing this added back into nutch. The instructions for creating a

RE: project vitality? / less documentation is more!

2006-03-07 Thread Vanderdray, Jacob
-1 I found the instructions for doing an Intranet crawl extremely helpful for getting up and running quickly. I went back later and figured out more about what it was actually doing. Perhaps the name could just be changed to Single Site Crawling with the Nutch Shell Script and some

RE: project vitality? / less documentation is more!

2006-03-07 Thread Vanderdray, Jacob
the smallest of websites. The distinction is not only on the scale of the project, but on the level of control one wants (IMHO). The documentation should at least give hints in that direction. Thanks, Frank. On 3/7/06, Vanderdray, Jacob [EMAIL PROTECTED] wrote: -1 I found the instructions

RE: Tutorial on the Wiki

2006-03-07 Thread Vanderdray, Jacob
. Common configuration issues: Maximum Content size: Maximum Retries. What else? I will gladly post these chnages to the Wiki, given some votes of confidence and other suggestions. -Original Message- From: Vanderdray, Jacob [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 07, 2006 1:52 PM

Extending BasicQueryFilter

2006-02-28 Thread Vanderdray, Jacob
What would I need to do to extend the BasicQueryFilter? Just using import org.apache.nutch.searcher.basic.BasicQueryFilter; doesn't seem be enough. Thanks, Jake.

RE: recommended plugin example

2006-02-24 Thread Vanderdray, Jacob
What version of nutch are you working with? I wrote the example based on the 0.7.1 base. The second error seems to indicate that you don't have a filter method in your indexer plugin. Check to make sure there isn't a typo in the name of the method. Good luck, Jake.

RE: Simple indexation and reindexation

2006-02-23 Thread Vanderdray, Jacob
If you look at the section of the tutorial for doing intranet crawls, you should be able to use that for your small number of websites. The bin/nutch script wraps up all the crawl functions for you (fetching, indexing, deduping, etc). You'll just need to delete the results of your

RE: Simple indexation and reindexation

2006-02-23 Thread Vanderdray, Jacob
the container (JBoss, in my case). Is this a feature inherited from lucene? Or maybe this will be improved in the future? Thanks again. En/na Vanderdray, Jacob ha escrit: If you look at the section of the tutorial for doing intranet crawls, you should be able to use that for your small

RE: meta in search query string

2006-02-23 Thread Vanderdray, Jacob
One difference you'll want is to change the plugin.xml file so that your query filter gets used just for the fields you're interested in. Instead of fields=DEFAULT in the example, you'll want raw-fields=language and raw-fields=category. Assuming you name the fields language and category

RE: Search Particulars

2006-02-23 Thread Vanderdray, Jacob
I'm not sure I understand what you're getting at. In this case I've added a comma separated list of names of meta tags that I want to index and search against. I've written a parse filter, an index filter and this query filter that all read in that list of meta tags from the

Search Particulars

2006-02-20 Thread Vanderdray, Jacob
I've written some extensions that allow you to define meta tags that you would like included in nutch indexing and searching. The meta tag names are defined in nutch-site.xml. In general this seems to be working, but I'm seeing some problems with searching. I've added the

RE: not indexing path names

2006-02-17 Thread Vanderdray, Jacob
similar options in crawl-urlfilter.txt. But in my case these directories do need to be crawled. However, the directory name should not be indexed as a single document. It's more like we'd have a file called index-urlfilter.txt. Thanks, --Jay Vanderdray, Jacob wrote: Jay, The url field

RE: not indexing path names

2006-02-16 Thread Vanderdray, Jacob
Jay, The url field is handled by the query-basic filter. There is a setting inside conf/nutch-default.xml that controls the weighting (boost) for that field. You can reduce the influence of this field by putting a new value in your conf/nutch-site.xml file. You may even be able to

RE: Link to Search Interface for List

2006-02-16 Thread Vanderdray, Jacob
Subject: Re: Link to Search Interface for List Vanderdray, Jacob wrote: I get the same thing from my linux box. The only reference I can find to linkmap.html is a commented out line in forrest.properties. FWIW: I've already made the changes to my copy of mailing_lists.xml. Let me know

RE: format for range date query

2006-02-14 Thread Vanderdray, Jacob
From the source of the query-more plugin: // query syntax is defined as date:mmdd-mmdd Jake. -Original Message- From: Raghavendra Prabhu [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 14, 2006 12:54 PM To: nutch-user@lucene.apache.org Subject: format for range date query Hi

RE: Nutch search engine can be used to search only on specific domain?

2006-02-14 Thread Vanderdray, Jacob
No problem. Check out the Intranet configuration section of the tutorial (http://lucene.apache.org/nutch/tutorial.html): Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl

Link to Search Interface for List

2006-02-14 Thread Vanderdray, Jacob
Would it be possible to add a link to http://www.mail-archive.com/nutch-user%40lucene.apache.org/ on http://lucene.apache.org/nutch/mailing_lists.html? I'd suggest it goes as another bullet point in under the Users mailing list section with the link text Search the List Archives.

RE: How to control contents to be indexed?

2006-02-10 Thread Vanderdray, Jacob
If you control the temporary links pages, then just add a robots meta tag. Take a look at http://www.robotstxt.org/wc/meta-user.html to see what your options are. Jake. -Original Message- From: Elwin [mailto:[EMAIL PROTECTED] Sent: Friday, February 10, 2006 4:38 AM To:

Wiki

2006-02-09 Thread Vanderdray, Jacob
Is anyone else having trouble getting to the Nutch wiki? While we're on the topic, does anyone know if the wiki gets backed up? Thanks, Jake.

RE: Wiki

2006-02-09 Thread Vanderdray, Jacob
it up? Jeff Vanderdray, Jacob wrote: Is anyone else having trouble getting to the Nutch wiki? While we're on the topic, does anyone know if the wiki gets backed up? Thanks, Jake.

RE: boosting custom field values in scoring algorithm

2006-02-08 Thread Vanderdray, Jacob
Scott, You probably just need to edit your plugin.xml file to set fields=DEFAULT. Take a look at http://wiki.apache.org/nutch/WritingPluginExample and see if that helps. Jake. -Original Message- From: Scott Owens [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 08, 2006

RE: [Nutch-general] RE: boosting custom field values in scoring algorithm

2006-02-08 Thread Vanderdray, Jacob
what i had before with fields=fieldname in the plugin.xml? Thanks a ton. Scott On 2/8/06, Vanderdray, Jacob [EMAIL PROTECTED] wrote: Scott, You probably just need to edit your plugin.xml file to set fields=DEFAULT. Take a look at http://wiki.apache.org/nutch/WritingPluginExample

RE: How deep to go

2006-02-07 Thread Vanderdray, Jacob
If you only want to crawl www.woodward.edu, then change +^http://([a-z0-9]*\.)*woodward.edu/ To: +^http://www.woodward.edu/ Jake. -Original Message- From: Andy Morris [mailto:[EMAIL PROTECTED] Sent: Monday, February 06, 2006 9:00 PM To: nutch-user@lucene.apache.org Subject:

RE: Problem with plugins

2006-01-31 Thread Vanderdray, Jacob
Enrico, What does your plugin.xml file look like? Also, what happens if you do a search for: quiewId:a3d32ce0cae0da47677f30cc6182d421 [A_word_found_in_the_body_of_the_document] Jake. -Original Message- From: Enrico Triolo [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 31,

RE: Use Nutch to collect web statistic

2006-01-31 Thread Vanderdray, Jacob
I think you'd need to write a plugin to extract, index and query the data you're interested in. I don't think that any of the existing plugins grab this data. Jake. -Original Message- From: Meryl Silverburgh [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 31, 2006 12:22 AM To:

RE: Problem with plugins

2006-01-31 Thread Vanderdray, Jacob
In the plugin.xml file bellow, try changing: implementation id=URLQueryFilter class= org.apache.nutch.searcher.quiew.QuiewQueryFilter fields=url/ To: implementation id=URLQueryFilter class=

RE: Problem with plugins

2006-01-31 Thread Vanderdray, Jacob
One other note: with the fields set to DEFAULT, you'll want to just query for a3d32ce0cae0da47677f30cc6182d421 instead of quiewId:a3d32ce0cae0da47677f30cc6182d421. Jake. -Original Message- From: Vanderdray, Jacob [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 31, 2006 11:07 AM

RE: passing type: or lang: as hidden field (not in query)

2006-01-31 Thread Vanderdray, Jacob
If you want to force that to be part of every query, you could edit search.jsp to add something like: queryString = queryString + type:pdf lang:en; If you want it to be the default, but let the user change it, I'd add fields to the search form and make the default be

RE: adding meta to domain

2006-01-31 Thread Vanderdray, Jacob
What's jira? I'm actually in the process of writing a set of plugins to process meta tags that you define in the nutch-site.xml file, so I'd be interested in reading about what's being worked on. Thanks, Jake. -Original Message- From: Stefan Groschupf [mailto:[EMAIL PROTECTED]

More Documentation

2006-01-29 Thread Vanderdray, Jacob
I went through and broke up the wiki page on writing plugins into a general page explaining what plugins are (http://wiki.apache.org/nutch/AboutPlugins) and a step-by-step example of writing a plugin (http://wiki.apache.org/nutch/WritingPluginExample). Feel free to edit and correct.

RE: Class Not Found

2005-12-07 Thread Vanderdray, Jacob
ant.apache.org instead of using an rpm. Thanks, Jake. -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Saturday, December 03, 2005 1:32 PM To: nutch-user@lucene.apache.org Subject: Re: Class Not Found Vanderdray, Jacob wrote: I installed ant from an rpm

RE: Class Not Found

2005-12-02 Thread Vanderdray, Jacob
Sure. So far I've gotten around this by using the pre-built ROOT.war from the 0.7 distribution and replacing the files I've added and changed. I'm planning on installing ant from source, but haven't gotten around to it yet. Thanks, Jake. $ ant war Buildfile: build.xml init:

Class Not Found

2005-11-30 Thread Vanderdray, Jacob
When I try building the war file from source, I'm getting an error about some ant stuff that I must not have installed. Does anyone know what package I need to get this fixed? java.lang.ClassNotFoundException: org.apache.tools.ant.taskdefs.optional.XslpLiaison Thanks, Jake.

RE: Class Not Found

2005-11-30 Thread Vanderdray, Jacob
, check that you are not using ant 1.5. Stefan Am 30.11.2005 um 20:30 schrieb Vanderdray, Jacob: When I try building the war file from source, I'm getting an error about some ant stuff that I must not have installed. Does anyone know what package I need to get this fixed

RE: org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(java.io.File f)

2005-11-28 Thread Vanderdray, Jacob
Try src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/FileRes ponse.java Jake. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Monday, November 28, 2005 12:51 PM To: nutch-user@lucene.apache.org Subject:

RE: How to Recompile and Build Modified Nutch?

2005-11-22 Thread Vanderdray, Jacob
Just run 'ant' in the top directory of your downloaded source. There should be a build.xml file in there that tells ant what to do. Jake. -Original Message- From: Victor Lee [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 22, 2005 4:13 PM To: nutch-user@lucene.apache.org Subject: How

Adding Field to be Searched

2005-11-22 Thread Vanderdray, Jacob
I've written a plugin that adds a new field to the index. When I look at the explanation page for a given search result I can see my field listed. How do I get the searcher to do searches against the new field? I'm reading through the code in src/java/org/apache/nutch/searcher and I

RE: Adding Field to be Searched

2005-11-22 Thread Vanderdray, Jacob
I can do that. Thanks! Jake. -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 22, 2005 5:37 PM To: nutch-user@lucene.apache.org Subject: Re: Adding Field to be Searched Vanderdray, Jacob wrote: I've written a plugin that adds a new

RE: Crawling a page for links, but not indexing it

2005-11-17 Thread Vanderdray, Jacob
Dean, I'm not sure if the nutch crawler actually supports it, but you should be able to use a robots noindex Meta tag in the archive pages. See http://www.robotstxt.org/wc/meta-user.html for more information. Jake. -Original Message- From: Dean Elwood [mailto:[EMAIL PROTECTED]

RE: Proposed Changes to Wiki HomePage

2005-11-14 Thread Vanderdray, Jacob
Thanks. I just swapped it in for the front page. I also found that pdf and fixed the link (http://wiki.apache.org/nutch/Evaluations). Jake. -Original Message- From: Jérôme Charron [mailto:[EMAIL PROTECTED] Sent: Sunday, November 13, 2005 4:09 PM To:

Documenting How to Write a Plugin

2005-11-14 Thread Vanderdray, Jacob
I've started work on documenting how to write a plugin. I'm writing it as I figure it out myself so any corrections or pointers are appreciated. http://wiki.apache.org/nutch/WritingPlugins Thanks, Jake.

RE: Proxy base href Problem

2005-11-10 Thread Vanderdray, Jacob
Bill, Take a look at search.jsp. It looks like it sets the base href based on the name you hit it as. Since you're proxying back to localhost:8080, that's what it sets as the base href. I think you should be able to just hardcode that to be rfe.org, restart tomcat and it should work.

RE: Plugin Documentation

2005-11-10 Thread Vanderdray, Jacob
/de/ Stefan Am 10.11.2005 um 20:34 schrieb Vanderdray, Jacob: Stefan, Do you mind if I take the pages you do have (Which technical concepts are behind the nutch plugin system, etc.) and put them into the nutch wiki as separate pages? I'd like to link to them from http

RE: Plugin Documentation

2005-11-10 Thread Vanderdray, Jacob
10.11.2005 um 21:48 schrieb Vanderdray, Jacob: Stefan, I'm in complete agreement that it doesn't make sense to edit the same stuff in two places. In reading through what you've written, it looks really good, so at a minimum I'll add links to it form the nutch wiki. Would you consider

RE: Compilation Errors

2005-11-09 Thread Vanderdray, Jacob
forward. For what its worth I get the same error if I use the source in the trunk instead of the 0.7 branch. Jake. -Original Message- From: Vanderdray, Jacob [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 08, 2005 4:55 PM To: nutch-user@lucene.apache.org Subject: Compilation Errors

Compilation Errors

2005-11-08 Thread Vanderdray, Jacob
I just used svn to grab the nutch-0.7 branch and I get the error bellow when I try to use ant. This is on a Mac (OS 10.4.3). I'm guessing I just need to uncomment some defaults, but any pointers would be appreciated. Thanks, Jake. compile-core: [javac] Compiling 247 source files to