Clustering plugin descriptor broken (fix included)
--
Key: NUTCH-228
URL: http://issues.apache.org/jira/browse/NUTCH-228
Project: Nutch
Type: Bug
Reporter: Dawid Weiss
Priority: Minor
The plugin descriptor
[ http://issues.apache.org/jira/browse/NUTCH-228?page=all ]
Dawid Weiss updated NUTCH-228:
--
Attachment: clustering.patch
This patch fixed the plugin descriptor and a typo in cluster.jsp that caused
wrong number of milliseconds to be dumped in the output
Jack Tang wrote:
Hi all
RegExp is widely used in nutch, and I now wondering is it jdk/jakarta
classes is faster enough?
Here is the benchmarks i found on web.
http://tusker.org/regex/regex_benchmark.html
it seems dk.brics.automaton.RegExp is fastest among the libs.
It's not only faster,
Dawid Weiss wrote:
It seems to me that there are two separate problems:
1) content parsing to avoid site structure - influences the index and
rankings
2) content parsing for KWIC snippet generation - influences the user
perception of the engine.
I'd agree that (2) is quite important for
[ http://issues.apache.org/jira/browse/NUTCH-228?page=all ]
Jerome Charron closed NUTCH-228:
Fix Version: 0.8-dev
Resolution: Fixed
Committed :
* http://svn.apache.org/viewcvs.cgi?rev=385267view=rev
*
[ http://issues.apache.org/jira/browse/NUTCH-217?page=all ]
Jerome Charron resolved NUTCH-217:
--
Resolution: Fixed
Fixed : http://svn.apache.org/viewcvs.cgi?view=revrev=384011
Thanks Dawid.
InstantiationException when deserializing Query (no
It's not only faster, it also scales better for large and complex
expressions, it is also possible to build automata from several
expressions with AND/OR operators, which is the use case we have in
regexp-utlfilter.
It seems awesome!
Does somebody plans to switch to this lib in nutch?
Does
Hmm... I'm not convinced. How would you generate the best snippet from a
relevant, but ignored chunk?
Good point... I guess you simply wouldn't generate anything at all (show
the title?). I guess structure text should not be relevant enough to
actually cause a hit on top of the search
Jérôme Charron wrote:
It's not only faster, it also scales better for large and complex
expressions, it is also possible to build automata from several
expressions with AND/OR operators, which is the use case we have in
regexp-utlfilter.
It seems awesome!
I forgot to add: it is also
Thanks for volunteering, you're welcome ... ;-)
Good job Andrzej !;-)
So, That's now in my todo list to check the perl5 compatibility issue and to
provide some benchs to the community...
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
I'd agree that (2) is quite important for the end user; Richard's
continuous text heuristic may actually work for that. I'd extend the
meaning of continuous block to ignore inline tags such as SPAN, I, B, TT
etc, so only certain tags would actually break the content into chunks.
Snippets then
improved handling of plugin folder configuration
Key: NUTCH-229
URL: http://issues.apache.org/jira/browse/NUTCH-229
Project: Nutch
Type: Improvement
Reporter: Stefan Groschupf
Priority: Critical
Fix For:
[ http://issues.apache.org/jira/browse/NUTCH-229?page=all ]
Stefan Groschupf updated NUTCH-229:
---
Attachment: pluginFolder.patch
A patch to be able using relative path that are not in the classpath.
improved handling of plugin folder configuration
13 matches
Mail list logo