[jira] Commented: (NUTCH-227) Basic Query Filter no more uses Configuration

2006-03-09 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-227?page=comments#action_12369660 ] Andrzej Bialecki commented on NUTCH-227: - Isn't it so that QueryFilter (which is an interface) already extends Configurable? What seems to be missing in

[jira] Commented: (NUTCH-227) Basic Query Filter no more uses Configuration

2006-03-09 Thread Marko Bauhardt (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-227?page=comments#action_12369665 ] Marko Bauhardt commented on NUTCH-227: -- take a look to Extension.java line: 151 to 154. Object object = extensionClazz.newInstance(); if(object instanceof

[jira] Closed: (NUTCH-227) Basic Query Filter no more uses Configuration

2006-03-09 Thread Jerome Charron (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-227?page=all ] Jerome Charron closed NUTCH-227: Resolution: Fixed Oups.. sorry guys... and thanks for you prompt remarks. All is in fact OK. Basic Query Filter no more uses Configuration

Re: [jira] Closed: (NUTCH-227) Basic Query Filter no more uses Configuration

2006-03-09 Thread Jérôme Charron
In fact, my first need was to be able to configure the boost for RawFieldQueryFilter. The idea is then to give to the user a better control of boost values by simply : * add a setBoost(float) method to RawFieldQueryFilter. * (add a setLowerCase(boolean) method to RawFieldQueryFilter) * Add some

Re: [jira] Closed: (NUTCH-227) Basic Query Filter no more uses Configuration

2006-03-09 Thread Stefan Groschupf
Jérôme, +1 Having the chance to write query filters that allows more control in general would be very helpful. Stefan Am 09.03.2006 um 18:35 schrieb Jérôme Charron: In fact, my first need was to be able to configure the boost for RawFieldQueryFilter. The idea is then to give to the user a

RE: Proposal for Avoiding Content Generation Sites

2006-03-09 Thread Gal Nitzan
Actually there is a property in conf: generate.max.per.host So if you add a message in Generator.java at the appropriate place... you have what you wish... Gal -Original Message- From: Rod Taylor [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 08, 2006 7:28 PM To: Nutch Developer

Contributing

2006-03-09 Thread Vertical Search
Hello, I was wondering, if any one is willing to consider some changes to make nutch more user friendly.. like to get a general feeling of the code base, reviewing code and cleaning up shadow variables, etc., Is some one doing it already ? I am willing to take some time to contribute. Are there

RE: Proposal for Avoiding Content Generation Sites

2006-03-09 Thread Rod Taylor
On Thu, 2006-03-09 at 21:51 +0200, Gal Nitzan wrote: Actually there is a property in conf: generate.max.per.host That has proven to be problematic. foo.domain.com bar.domain.com baz.domain.com *** Repeat up to 4 Million times for some content generator sites *** Each of these gets a different

Re: Proposal for Avoiding Content Generation Sites

2006-03-09 Thread Doug Cutting
Rod Taylor wrote: First is to allow for cleaning up. This consists of a new option to updatedb which can scrub the database of all URLs which no longer match URLFilter settings (regex-urlfilter.txt). This allows a change in the urlfilter to be reflected against Nutches current dataset,

Site switched to branch-0.7.

2006-03-09 Thread Piotr Kosiorowski
Hi, I have updated site in 0.7 branch with latest trunk changes. I have added both tutorials to the site so people will be aware of differences. I have also committed DOAP file in 0.7 branch. Nutch Website uses branch-0.7 now. Piotr

Nutch 0.7.2

2006-03-09 Thread Piotr Kosiorowski
Hello, I would like to release nutch 0.7.2 in a week or two. Some serious bugfixes are already covered and I have a plan to fix one or two more. I found an email from Doug with title [Fwd: Crawler submits forms?] stating: This has been fixed in the mapred branch, but that patch is not in

[jira] Closed: (NUTCH-225) Changed the links to the tutorial to point to the wiki

2006-03-09 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-225?page=all ] Piotr Kosiorowski closed NUTCH-225: --- Resolution: Won't Fix I have just updated Nutch Web site. It contains now both tutorials (for 0.7 and 0.8). I have also added a notr to each

Re: Tutorial

2006-03-09 Thread Piotr Kosiorowski
Upps, sorry for ignoring this discussion - i was looking for comments in JIRA and already committed the change before reading your discussion. My motivation is to have usable version of tutorial - as simple as it is possible to be versioned with the sources - only for historical purposes - if

Re: Proposal for Avoiding Content Generation Sites

2006-03-09 Thread Rod Taylor
First is to allow for cleaning up. This consists of a new option to updatedb which can scrub the database of all URLs which no longer match URLFilter settings (regex-urlfilter.txt). This allows a change in the urlfilter to be reflected against Nutches current dataset, something I think

[jira] Closed: (NUTCH-91) empty encoding causes exception

2006-03-09 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-91?page=all ] Piotr Kosiorowski closed NUTCH-91: -- Fix Version: 0.7.2-dev 0.8-dev Resolution: Fixed Commited with small extension. Thanks. empty encoding causes exception

RE: Tutorial

2006-03-09 Thread Vanderdray, Jacob
+1 If we go with that idea, then the one on the website should be the tutorial for the latest release with a link to the wiki for the dev version of the tutorial and a note explaining that tutorials for older versions come with the source. Jake. -Original Message- From:

Re: Proposal for Avoiding Content Generation Sites

2006-03-09 Thread Andrzej Bialecki
Rod Taylor wrote: Doing the actual expunging during updatedb is better than as a separate command for performance. As a periodic option (scrubbing content generation or abuse sites in my case) combining with updatedb will reduce the IO and CPU requirements. Updatedb already reads in the DB,

Re: Nutch 0.7.2

2006-03-09 Thread ogjunk-nutch
I'm still on 0.7*, and would welcome a new release. Otis - Original Message From: Piotr Kosiorowski [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Thursday, March 9, 2006 3:31:09 PM Subject: Nutch 0.7.2 Hello, I would like to release nutch 0.7.2 in a week or two. Some serious

Re: Nutch 0.7.2

2006-03-09 Thread Doug Cutting
Piotr Kosiorowski wrote: I found an email from Doug with title [Fwd: Crawler submits forms?] stating: This has been fixed in the mapred branch, but that patch is not in 0.7.1. This alone might be a reason to make a 0.7.2 release. I just want to make sure it was fixed by svn commit: r348533