[jira] Created: (NUTCH-228) Clustering plugin descriptor broken (fix included)

2006-03-12 Thread Dawid Weiss (JIRA)
Clustering plugin descriptor broken (fix included) -- Key: NUTCH-228 URL: http://issues.apache.org/jira/browse/NUTCH-228 Project: Nutch Type: Bug Reporter: Dawid Weiss Priority: Minor The plugin descriptor

[jira] Updated: (NUTCH-228) Clustering plugin descriptor broken (fix included)

2006-03-12 Thread Dawid Weiss (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-228?page=all ] Dawid Weiss updated NUTCH-228: -- Attachment: clustering.patch This patch fixed the plugin descriptor and a typo in cluster.jsp that caused wrong number of milliseconds to be dumped in the output

Re: Much faster RegExp lib needed in nutch?

2006-03-12 Thread Andrzej Bialecki
Jack Tang wrote: Hi all RegExp is widely used in nutch, and I now wondering is it jdk/jakarta classes is faster enough? Here is the benchmarks i found on web. http://tusker.org/regex/regex_benchmark.html it seems dk.brics.automaton.RegExp is fastest among the libs. It's not only faster,

Re: quality of search text

2006-03-12 Thread Andrzej Bialecki
Dawid Weiss wrote: It seems to me that there are two separate problems: 1) content parsing to avoid site structure - influences the index and rankings 2) content parsing for KWIC snippet generation - influences the user perception of the engine. I'd agree that (2) is quite important for

[jira] Closed: (NUTCH-228) Clustering plugin descriptor broken (fix included)

2006-03-12 Thread Jerome Charron (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-228?page=all ] Jerome Charron closed NUTCH-228: Fix Version: 0.8-dev Resolution: Fixed Committed : * http://svn.apache.org/viewcvs.cgi?rev=385267view=rev *

[jira] Resolved: (NUTCH-217) InstantiationException when deserializing Query (no parameterless constructor)

2006-03-12 Thread Jerome Charron (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-217?page=all ] Jerome Charron resolved NUTCH-217: -- Resolution: Fixed Fixed : http://svn.apache.org/viewcvs.cgi?view=revrev=384011 Thanks Dawid. InstantiationException when deserializing Query (no

Re: Much faster RegExp lib needed in nutch?

2006-03-12 Thread Jérôme Charron
It's not only faster, it also scales better for large and complex expressions, it is also possible to build automata from several expressions with AND/OR operators, which is the use case we have in regexp-utlfilter. It seems awesome! Does somebody plans to switch to this lib in nutch? Does

Re: quality of search text

2006-03-12 Thread Dawid Weiss
Hmm... I'm not convinced. How would you generate the best snippet from a relevant, but ignored chunk? Good point... I guess you simply wouldn't generate anything at all (show the title?). I guess structure text should not be relevant enough to actually cause a hit on top of the search

Re: Much faster RegExp lib needed in nutch?

2006-03-12 Thread Andrzej Bialecki
Jérôme Charron wrote: It's not only faster, it also scales better for large and complex expressions, it is also possible to build automata from several expressions with AND/OR operators, which is the use case we have in regexp-utlfilter. It seems awesome! I forgot to add: it is also

Re: Much faster RegExp lib needed in nutch?

2006-03-12 Thread Jérôme Charron
Thanks for volunteering, you're welcome ... ;-) Good job Andrzej !;-) So, That's now in my todo list to check the perl5 compatibility issue and to provide some benchs to the community... Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: quality of search text

2006-03-12 Thread Howie Wang
I'd agree that (2) is quite important for the end user; Richard's continuous text heuristic may actually work for that. I'd extend the meaning of continuous block to ignore inline tags such as SPAN, I, B, TT etc, so only certain tags would actually break the content into chunks. Snippets then

[jira] Created: (NUTCH-229) improved handling of plugin folder configuration

2006-03-12 Thread Stefan Groschupf (JIRA)
improved handling of plugin folder configuration Key: NUTCH-229 URL: http://issues.apache.org/jira/browse/NUTCH-229 Project: Nutch Type: Improvement Reporter: Stefan Groschupf Priority: Critical Fix For:

[jira] Updated: (NUTCH-229) improved handling of plugin folder configuration

2006-03-12 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-229?page=all ] Stefan Groschupf updated NUTCH-229: --- Attachment: pluginFolder.patch A patch to be able using relative path that are not in the classpath. improved handling of plugin folder configuration