Re: best file system for NDFS?

2005-12-13 Thread Leen Toelen
I would say the same. I don't think anyone can predict wat will happen, so I suggest someone does some tests with different filesystems AND different block sizes etc. Results will probably even differ on different hardware as well. Regards, Leen Toelen On 12/13/05, Andrzej Bialecki <[EMAIL PROTE

Timeout that does not retry

2005-12-13 Thread Rod Taylor
Every once in a while I come across one of these types of timeouts. They do not cancel the job nor do they seem to retry the task -- they appear to just sit waiting for someone to manually remove it from the jobtracker. task_r_zgsc0j 1.0 reduce > reduce task_r_o97cc4 0.5 reduce > sort Timed

problem in merging index

2005-12-13 Thread Rozina Sorathia
I have a separate application which uses lucene APIs for creating an index. Now when I try to merge this index with the nutch index that is with one of the index folder present in the nutch-segments folder Using the API addIndexes(Directory[]) , I get an exception saying  that some .f1 f

[jira] Created: (NUTCH-140) Add alias capability in parse-plugins.xml file that allows mimeType->extensionId mapping

2005-12-13 Thread Chris A. Mattmann (JIRA)
Add alias capability in parse-plugins.xml file that allows mimeType->extensionId mapping Key: NUTCH-140 URL: http://issues.apache.org/jira/browse/NUTCH-140 Project: Nutch Type:

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2005-12-13 Thread Chris A. Mattmann (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ] Chris A. Mattmann updated NUTCH-139: Priority: Minor (was: Major) > Standard metadata property names in the ParseData metadata > -- >

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2005-12-13 Thread Chris A. Mattmann (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360389 ] Chris A. Mattmann commented on NUTCH-139: - According to Andrzej: "I agree, too. Perhaps we should use the names as they appear in the Dublin Core for those properties

[jira] Created: (NUTCH-139) Standard metadata property names in the ParseData metadata

2005-12-13 Thread Chris A. Mattmann (JIRA)
Standard metadata property names in the ParseData metadata --- Key: NUTCH-139 URL: http://issues.apache.org/jira/browse/NUTCH-139 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.1, 0.

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-13 Thread Doug Cutting
Andrzej Bialecki wrote: Ok, I just tested IndexSorter for now. It appears to work correctly, at least I get exactly the same results, with the same scores and the same explanations, if I run the smae queries on the original and on the sorted index. Here's a more complete version, still mostly

[jira] Created: (NUTCH-138) non-Latin-1 characters cannot be submitted for search

2005-12-13 Thread KuroSaka TeruHiko (JIRA)
non-Latin-1 characters cannot be submitted for search - Key: NUTCH-138 URL: http://issues.apache.org/jira/browse/NUTCH-138 Project: Nutch Type: Bug Components: web gui Versions: 0.7.1 Environment: Windo

[jira] Created: (NUTCH-137) footer is not displayed in search result page

2005-12-13 Thread KuroSaka TeruHiko (JIRA)
footer is not displayed in search result page - Key: NUTCH-137 URL: http://issues.apache.org/jira/browse/NUTCH-137 Project: Nutch Type: Bug Components: web gui Versions: 0.7.1 Environment: Windows XP, Japanese

Re: [Fwd: Crawler submits forms?]

2005-12-13 Thread Andrzej Bialecki
Jérôme Charron wrote: +1 for a 0.7.2 release. +1. Things are going well on the mapred branch, all basic tools are almost in place, so after this release we will probably start merging... so, this looks like the last release of the 0.7.x line (from the code in trunk/ - I'm sure there wil

Re: [Fwd: Crawler submits forms?]

2005-12-13 Thread Jérôme Charron
+1 for a 0.7.2 release. Here are the issues/revisions I can merge to 0.7 branch. These changes mainly concern the parser-factory changes (NUTCH-88) http://issues.apache.org/jira/browse/NUTCH-112 http://issues.apache.org/jira/browse/NUTCH-135 http://svn.apache.org/viewcvs.cgi?rev=356532&view=rev ht

Re: Standard metadata property names in the ParseData metadata

2005-12-13 Thread Chris Mattmann
Hi Guys, Okay, that makes sense then. I will create an issue in JIRA later today describing the update, and then begin working on this over the next few days. Thanks for your responses and reviews. Cheers, Chris On 12/13/05 12:45 PM, "Jérôme Charron" <[EMAIL PROTECTED]> wrote: >> I agree,

Re: best file system for NDFS?

2005-12-13 Thread Rod Taylor
On Tue, 2005-12-13 at 21:43 +0100, Andrzej Bialecki wrote: > > Most of the time we deal with very large files, with sequential > access. > Only in few places we deal with a lot of small files (e.g. indexing). > So, I think the best would be an FS optimized for efficient > sequential > write/rea

Re: Standard metadata property names in the ParseData metadata

2005-12-13 Thread Jérôme Charron
> I agree, too. Perhaps we should use the names as they appear in the > Dublin Core for those properties that are defined there A big YES! > - just prepended > them with "X-nutch-" in order to avoid name-clashes with other > properties (e.g. blindly copied from the protocol headers). Another bi

Re: best file system for NDFS?

2005-12-13 Thread Andrzej Bialecki
Stefan Groschupf wrote: Hi geeks, I have not that much much deep knowledge about the unix file systems, so my questions what would be the best file system for nutch distributed file systems data nodes? Does it make any different using the one or the other file system? Would reiserFS a good

Re: Standard metadata property names in the ParseData metadata

2005-12-13 Thread Andrzej Bialecki
Stefan Groschupf wrote: +1! BTW, did you notice that Jerome committed a patch that makes Content meta data now case insensitive? I agree, too. Perhaps we should use the names as they appear in the Dublin Core for those properties that are defined there - just prepended them with "X-nutch

Re: Standard metadata property names in the ParseData metadata

2005-12-13 Thread Jérôme Charron
+1 A simple solution that provides a standard way to access common meta data. Great! -- http://motrech.free.fr/ http://www.frutch.org/

Re: [Fwd: Crawler submits forms?]

2005-12-13 Thread Piotr Kosiorowski
If we are going to make 0.7.2 release I would like to commit a patch for http://issues.apache.org/jira/browse/NUTCH-112 and probably for some build problems people are raporting (missing src folder in nutch-extension plugin). I will look at them in next few days. Regards Piotr Stefan Groschupf w

Idea about aliases in the parse-plugins.xml file

2005-12-13 Thread Chris Mattmann
Hi Folks, Jerome and I have been talking about an idea to address the current issue raised by Stefan G. about having a mapping of mimeType->list of pluginIds rather than mimeType->list of extensionIds in the parse-plugins.xml file. We've come up with the following proposed update that would seem

best file system for NDFS?

2005-12-13 Thread Stefan Groschupf
Hi geeks, I have not that much much deep knowledge about the unix file systems, so my questions what would be the best file system for nutch distributed file systems data nodes? Does it make any different using the one or the other file system? Would reiserFS a good choice? Thanks for any c

Re: Standard metadata property names in the ParseData metadata

2005-12-13 Thread Chris Mattmann
Hi Stefan, Thanks. Yup, I noticed it and I think it will really help out a lot. Great job to the both of you :-) Cheers, Chris On 12/13/05 10:59 AM, "Stefan Groschupf" <[EMAIL PROTECTED]> wrote: > +1! > BTW, did you notice that Jerome committed a patch that makes Content > meta data now c

Re: [Fwd: Crawler submits forms?]

2005-12-13 Thread Stefan Groschupf
This has been fixed in the mapred branch, but that patch is not in 0.7.1. This alone might be a reason to make a 0.7.2 release. May we can get fixed some more parser selection related issue until next days also and get this into a 0.7.2 release. I would be happy to see some more parser selec

Re: Standard metadata property names in the ParseData metadata

2005-12-13 Thread Stefan Groschupf
+1! BTW, did you notice that Jerome committed a patch that makes Content meta data now case insensitive? Stefan Am 13.12.2005 um 18:07 schrieb Chris Mattmann: Hi Folks, I was just thinking about the ParseData java.util.Properties metaata object and thinking about the way that we store

[Fwd: Crawler submits forms?]

2005-12-13 Thread Doug Cutting
FYI This has been fixed in the mapred branch, but that patch is not in 0.7.1. This alone might be a reason to make a 0.7.2 release. Doug Original Message Subject: Crawler submits forms? Date: Tue, 13 Dec 2005 16:57:34 - From: Andy Read <[EMAIL PROTECTED]> Reply-To: nutc

Standard metadata property names in the ParseData metadata

2005-12-13 Thread Chris Mattmann
Hi Folks, I was just thinking about the ParseData java.util.Properties metaata object and thinking about the way that we store names in there. Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT

Re: Hard-coded Content-type checks

2005-12-13 Thread Andrzej Bialecki
Jérôme Charron wrote: If there is no objection, I will commit these changes in the next hours. +1. Great stuff! Finally we will be able to predict which parser works on which content... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [_

Re: Hard-coded Content-type checks

2005-12-13 Thread Stefan Groschupf
If there is no objection, I will commit these changes in the next hours. + 1!!! :-)

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-13 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: Shouldn't this be combined with a HitCollector that collects only the first-n matches? Otherwise we still need to scan the whole posting list... Yes. I was just posting the work-in-progress. Ok, I just tested IndexSorter for now. It appears t

Hard-coded Content-type checks

2005-12-13 Thread Jérôme Charron
Hi, I would like to remove all the hard-coded content-type checks spread over all the parse plugins. In fact, the content-type/plugin-id mapping is now centralized in the parse-plugin.xml file, and there's no more needs for the parser to check the content-type. The basic idea was: 1. The developer

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-13 Thread Doug Cutting
Andrzej Bialecki wrote: Shouldn't this be combined with a HitCollector that collects only the first-n matches? Otherwise we still need to scan the whole posting list... Yes. I was just posting the work-in-progress. We will also need to estimate the total number of matches by extrapolating li