Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-13 Thread Doug Cutting
Andrzej Bialecki wrote: Shouldn't this be combined with a HitCollector that collects only the first-n matches? Otherwise we still need to scan the whole posting list... Yes. I was just posting the work-in-progress. We will also need to estimate the total number of matches by extrapolating

Hard-coded Content-type checks

2005-12-13 Thread Jérôme Charron
Hi, I would like to remove all the hard-coded content-type checks spread over all the parse plugins. In fact, the content-type/plugin-id mapping is now centralized in the parse-plugin.xml file, and there's no more needs for the parser to check the content-type. The basic idea was: 1. The

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-13 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: Shouldn't this be combined with a HitCollector that collects only the first-n matches? Otherwise we still need to scan the whole posting list... Yes. I was just posting the work-in-progress. Ok, I just tested IndexSorter for now. It appears

Re: Hard-coded Content-type checks

2005-12-13 Thread Stefan Groschupf
If there is no objection, I will commit these changes in the next hours. + 1!!! :-)

Re: Hard-coded Content-type checks

2005-12-13 Thread Andrzej Bialecki
Jérôme Charron wrote: If there is no objection, I will commit these changes in the next hours. +1. Great stuff! Finally we will be able to predict which parser works on which content... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__

Standard metadata property names in the ParseData metadata

2005-12-13 Thread Chris Mattmann
Hi Folks, I was just thinking about the ParseData java.util.Properties metaata object and thinking about the way that we store names in there. Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content-TyPe,

[Fwd: Crawler submits forms?]

2005-12-13 Thread Doug Cutting
FYI This has been fixed in the mapred branch, but that patch is not in 0.7.1. This alone might be a reason to make a 0.7.2 release. Doug Original Message Subject: Crawler submits forms? Date: Tue, 13 Dec 2005 16:57:34 - From: Andy Read [EMAIL PROTECTED] Reply-To:

Re: Standard metadata property names in the ParseData metadata

2005-12-13 Thread Stefan Groschupf
+1! BTW, did you notice that Jerome committed a patch that makes Content meta data now case insensitive? Stefan Am 13.12.2005 um 18:07 schrieb Chris Mattmann: Hi Folks, I was just thinking about the ParseData java.util.Properties metaata object and thinking about the way that we store

Re: [Fwd: Crawler submits forms?]

2005-12-13 Thread Stefan Groschupf
This has been fixed in the mapred branch, but that patch is not in 0.7.1. This alone might be a reason to make a 0.7.2 release. May we can get fixed some more parser selection related issue until next days also and get this into a 0.7.2 release. I would be happy to see some more parser

Re: Standard metadata property names in the ParseData metadata

2005-12-13 Thread Chris Mattmann
Hi Stefan, Thanks. Yup, I noticed it and I think it will really help out a lot. Great job to the both of you :-) Cheers, Chris On 12/13/05 10:59 AM, Stefan Groschupf [EMAIL PROTECTED] wrote: +1! BTW, did you notice that Jerome committed a patch that makes Content meta data now case

best file system for NDFS?

2005-12-13 Thread Stefan Groschupf
Hi geeks, I have not that much much deep knowledge about the unix file systems, so my questions what would be the best file system for nutch distributed file systems data nodes? Does it make any different using the one or the other file system? Would reiserFS a good choice? Thanks for any

Idea about aliases in the parse-plugins.xml file

2005-12-13 Thread Chris Mattmann
Hi Folks, Jerome and I have been talking about an idea to address the current issue raised by Stefan G. about having a mapping of mimeType-list of pluginIds rather than mimeType-list of extensionIds in the parse-plugins.xml file. We've come up with the following proposed update that would

Re: Standard metadata property names in the ParseData metadata

2005-12-13 Thread Jérôme Charron
+1 A simple solution that provides a standard way to access common meta data. Great! -- http://motrech.free.fr/ http://www.frutch.org/

Re: Standard metadata property names in the ParseData metadata

2005-12-13 Thread Andrzej Bialecki
Stefan Groschupf wrote: +1! BTW, did you notice that Jerome committed a patch that makes Content meta data now case insensitive? I agree, too. Perhaps we should use the names as they appear in the Dublin Core for those properties that are defined there - just prepended them with

Re: best file system for NDFS?

2005-12-13 Thread Andrzej Bialecki
Stefan Groschupf wrote: Hi geeks, I have not that much much deep knowledge about the unix file systems, so my questions what would be the best file system for nutch distributed file systems data nodes? Does it make any different using the one or the other file system? Would reiserFS a

Re: best file system for NDFS?

2005-12-13 Thread Rod Taylor
On Tue, 2005-12-13 at 21:43 +0100, Andrzej Bialecki wrote: Most of the time we deal with very large files, with sequential access. Only in few places we deal with a lot of small files (e.g. indexing). So, I think the best would be an FS optimized for efficient sequential write/read of

Re: Standard metadata property names in the ParseData metadata

2005-12-13 Thread Chris Mattmann
Hi Guys, Okay, that makes sense then. I will create an issue in JIRA later today describing the update, and then begin working on this over the next few days. Thanks for your responses and reviews. Cheers, Chris On 12/13/05 12:45 PM, Jérôme Charron [EMAIL PROTECTED] wrote: I agree, too.

Re: [Fwd: Crawler submits forms?]

2005-12-13 Thread Jérôme Charron
+1 for a 0.7.2 release. Here are the issues/revisions I can merge to 0.7 branch. These changes mainly concern the parser-factory changes (NUTCH-88) http://issues.apache.org/jira/browse/NUTCH-112 http://issues.apache.org/jira/browse/NUTCH-135 http://svn.apache.org/viewcvs.cgi?rev=356532view=rev

Re: [Fwd: Crawler submits forms?]

2005-12-13 Thread Andrzej Bialecki
Jérôme Charron wrote: +1 for a 0.7.2 release. +1. Things are going well on the mapred branch, all basic tools are almost in place, so after this release we will probably start merging... so, this looks like the last release of the 0.7.x line (from the code in trunk/ - I'm sure there

[jira] Created: (NUTCH-137) footer is not displayed in search result page

2005-12-13 Thread KuroSaka TeruHiko (JIRA)
footer is not displayed in search result page - Key: NUTCH-137 URL: http://issues.apache.org/jira/browse/NUTCH-137 Project: Nutch Type: Bug Components: web gui Versions: 0.7.1 Environment: Windows XP, Japanese

[jira] Created: (NUTCH-138) non-Latin-1 characters cannot be submitted for search

2005-12-13 Thread KuroSaka TeruHiko (JIRA)
non-Latin-1 characters cannot be submitted for search - Key: NUTCH-138 URL: http://issues.apache.org/jira/browse/NUTCH-138 Project: Nutch Type: Bug Components: web gui Versions: 0.7.1 Environment:

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2005-12-13 Thread Chris A. Mattmann (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360389 ] Chris A. Mattmann commented on NUTCH-139: - According to Andrzej: I agree, too. Perhaps we should use the names as they appear in the Dublin Core for those properties

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2005-12-13 Thread Chris A. Mattmann (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ] Chris A. Mattmann updated NUTCH-139: Priority: Minor (was: Major) Standard metadata property names in the ParseData metadata --

[jira] Created: (NUTCH-140) Add alias capability in parse-plugins.xml file that allows mimeType-extensionId mapping

2005-12-13 Thread Chris A. Mattmann (JIRA)
Add alias capability in parse-plugins.xml file that allows mimeType-extensionId mapping Key: NUTCH-140 URL: http://issues.apache.org/jira/browse/NUTCH-140 Project: Nutch Type:

problem in merging index

2005-12-13 Thread Rozina Sorathia
I have a separate application which uses lucene APIs for creating an index. Now when I try to merge this index with the nutch index that is with one of the index folder present in the nutch-segments folder Using the API addIndexes(Directory[]) , I get an exception saying that some .f1