Andrzej Bialecki wrote:
Shouldn't this be combined with a HitCollector that collects only the
first-n matches? Otherwise we still need to scan the whole posting list...
Yes. I was just posting the work-in-progress.
We will also need to estimate the total number of matches by
extrapolating
Hi,
I would like to remove all the hard-coded content-type checks spread over
all the parse plugins.
In fact, the content-type/plugin-id mapping is now centralized in the
parse-plugin.xml file, and there's no
more needs for the parser to check the content-type.
The basic idea was:
1. The
Doug Cutting wrote:
Andrzej Bialecki wrote:
Shouldn't this be combined with a HitCollector that collects only the
first-n matches? Otherwise we still need to scan the whole posting
list...
Yes. I was just posting the work-in-progress.
Ok, I just tested IndexSorter for now. It appears
If there is no objection, I will commit these changes in the next
hours.
+ 1!!! :-)
Jérôme Charron wrote:
If there is no objection, I will commit these changes in the next hours.
+1. Great stuff! Finally we will be able to predict which parser works
on which content...
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__
Hi Folks,
I was just thinking about the ParseData java.util.Properties metaata object
and thinking about the way that we store names in there. Currently, people
are free to name their string-based properties anything that they want, such
as having names of Content-type, content-TyPe,
FYI
This has been fixed in the mapred branch, but that patch is not in
0.7.1. This alone might be a reason to make a 0.7.2 release.
Doug
Original Message
Subject: Crawler submits forms?
Date: Tue, 13 Dec 2005 16:57:34 -
From: Andy Read [EMAIL PROTECTED]
Reply-To:
+1!
BTW, did you notice that Jerome committed a patch that makes Content
meta data now case insensitive?
Stefan
Am 13.12.2005 um 18:07 schrieb Chris Mattmann:
Hi Folks,
I was just thinking about the ParseData java.util.Properties
metaata object
and thinking about the way that we store
This has been fixed in the mapred branch, but that patch is not in
0.7.1. This alone might be a reason to make a 0.7.2 release.
May we can get fixed some more parser selection related issue until
next days also and get this into a 0.7.2 release.
I would be happy to see some more parser
Hi Stefan,
Thanks. Yup, I noticed it and I think it will really help out a lot. Great
job to the both of you :-)
Cheers,
Chris
On 12/13/05 10:59 AM, Stefan Groschupf [EMAIL PROTECTED] wrote:
+1!
BTW, did you notice that Jerome committed a patch that makes Content
meta data now case
Hi geeks,
I have not that much much deep knowledge about the unix file systems,
so my questions what would be the best file system for nutch
distributed file systems data nodes?
Does it make any different using the one or the other file system?
Would reiserFS a good choice?
Thanks for any
Hi Folks,
Jerome and I have been talking about an idea to address the current issue
raised by Stefan G. about having a mapping of mimeType-list of pluginIds
rather than mimeType-list of extensionIds in the parse-plugins.xml file.
We've come up with the following proposed update that would
+1
A simple solution that provides a standard way to access common meta data.
Great!
--
http://motrech.free.fr/
http://www.frutch.org/
Stefan Groschupf wrote:
+1!
BTW, did you notice that Jerome committed a patch that makes Content
meta data now case insensitive?
I agree, too. Perhaps we should use the names as they appear in the
Dublin Core for those properties that are defined there - just prepended
them with
Stefan Groschupf wrote:
Hi geeks,
I have not that much much deep knowledge about the unix file systems,
so my questions what would be the best file system for nutch
distributed file systems data nodes?
Does it make any different using the one or the other file system?
Would reiserFS a
On Tue, 2005-12-13 at 21:43 +0100, Andrzej Bialecki wrote:
Most of the time we deal with very large files, with sequential
access.
Only in few places we deal with a lot of small files (e.g. indexing).
So, I think the best would be an FS optimized for efficient
sequential
write/read of
Hi Guys,
Okay, that makes sense then. I will create an issue in JIRA later today
describing the update, and then begin working on this over the next few
days.
Thanks for your responses and reviews.
Cheers,
Chris
On 12/13/05 12:45 PM, Jérôme Charron [EMAIL PROTECTED] wrote:
I agree, too.
+1 for a 0.7.2 release.
Here are the issues/revisions I can merge to 0.7 branch.
These changes mainly concern the parser-factory changes (NUTCH-88)
http://issues.apache.org/jira/browse/NUTCH-112
http://issues.apache.org/jira/browse/NUTCH-135
http://svn.apache.org/viewcvs.cgi?rev=356532view=rev
Jérôme Charron wrote:
+1 for a 0.7.2 release.
+1.
Things are going well on the mapred branch, all basic tools are almost
in place, so after this release we will probably start merging... so,
this looks like the last release of the 0.7.x line (from the code in
trunk/ - I'm sure there
footer is not displayed in search result page
-
Key: NUTCH-137
URL: http://issues.apache.org/jira/browse/NUTCH-137
Project: Nutch
Type: Bug
Components: web gui
Versions: 0.7.1
Environment: Windows XP, Japanese
non-Latin-1 characters cannot be submitted for search
-
Key: NUTCH-138
URL: http://issues.apache.org/jira/browse/NUTCH-138
Project: Nutch
Type: Bug
Components: web gui
Versions: 0.7.1
Environment:
[
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360389 ]
Chris A. Mattmann commented on NUTCH-139:
-
According to Andrzej:
I agree, too. Perhaps we should use the names as they appear in the Dublin
Core for those properties
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]
Chris A. Mattmann updated NUTCH-139:
Priority: Minor (was: Major)
Standard metadata property names in the ParseData metadata
--
Add alias capability in parse-plugins.xml file that allows
mimeType-extensionId mapping
Key: NUTCH-140
URL: http://issues.apache.org/jira/browse/NUTCH-140
Project: Nutch
Type:
I have a separate application which uses
lucene APIs for creating an index.
Now when I try to merge this index with
the nutch index that is with one of the index folder present in the
nutch-segments folder
Using the API addIndexes(Directory[]) , I get
an exception saying that some .f1
25 matches
Mail list logo