I would say the same. I don't think anyone can predict wat will
happen, so I suggest someone does some tests with different
filesystems AND different block sizes etc. Results will probably even
differ on different hardware as well.
Regards,
Leen Toelen
On 12/13/05, Andrzej Bialecki <[EMAIL PROTE
Every once in a while I come across one of these types of timeouts. They
do not cancel the job nor do they seem to retry the task -- they appear
to just sit waiting for someone to manually remove it from the
jobtracker.
task_r_zgsc0j 1.0 reduce > reduce
task_r_o97cc4 0.5 reduce > sort Timed
I have a separate application which uses
lucene APIs for creating an index.
Now when I try to merge this index with
the nutch index that is with one of the index folder present in the
nutch-segments folder
Using the API addIndexes(Directory[]) , I get
an exception saying that some .f1 f
Add alias capability in parse-plugins.xml file that allows
mimeType->extensionId mapping
Key: NUTCH-140
URL: http://issues.apache.org/jira/browse/NUTCH-140
Project: Nutch
Type:
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]
Chris A. Mattmann updated NUTCH-139:
Priority: Minor (was: Major)
> Standard metadata property names in the ParseData metadata
> --
>
[
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360389 ]
Chris A. Mattmann commented on NUTCH-139:
-
According to Andrzej:
"I agree, too. Perhaps we should use the names as they appear in the Dublin
Core for those properties
Standard metadata property names in the ParseData metadata
---
Key: NUTCH-139
URL: http://issues.apache.org/jira/browse/NUTCH-139
Project: Nutch
Type: Improvement
Components: fetcher
Versions: 0.7.1, 0.
Andrzej Bialecki wrote:
Ok, I just tested IndexSorter for now. It appears to work correctly, at
least I get exactly the same results, with the same scores and the same
explanations, if I run the smae queries on the original and on the
sorted index.
Here's a more complete version, still mostly
non-Latin-1 characters cannot be submitted for search
-
Key: NUTCH-138
URL: http://issues.apache.org/jira/browse/NUTCH-138
Project: Nutch
Type: Bug
Components: web gui
Versions: 0.7.1
Environment: Windo
footer is not displayed in search result page
-
Key: NUTCH-137
URL: http://issues.apache.org/jira/browse/NUTCH-137
Project: Nutch
Type: Bug
Components: web gui
Versions: 0.7.1
Environment: Windows XP, Japanese
Jérôme Charron wrote:
+1 for a 0.7.2 release.
+1.
Things are going well on the mapred branch, all basic tools are almost
in place, so after this release we will probably start merging... so,
this looks like the last release of the 0.7.x line (from the code in
trunk/ - I'm sure there wil
+1 for a 0.7.2 release.
Here are the issues/revisions I can merge to 0.7 branch.
These changes mainly concern the parser-factory changes (NUTCH-88)
http://issues.apache.org/jira/browse/NUTCH-112
http://issues.apache.org/jira/browse/NUTCH-135
http://svn.apache.org/viewcvs.cgi?rev=356532&view=rev
ht
Hi Guys,
Okay, that makes sense then. I will create an issue in JIRA later today
describing the update, and then begin working on this over the next few
days.
Thanks for your responses and reviews.
Cheers,
Chris
On 12/13/05 12:45 PM, "Jérôme Charron" <[EMAIL PROTECTED]> wrote:
>> I agree,
On Tue, 2005-12-13 at 21:43 +0100, Andrzej Bialecki wrote:
>
> Most of the time we deal with very large files, with sequential
> access.
> Only in few places we deal with a lot of small files (e.g. indexing).
> So, I think the best would be an FS optimized for efficient
> sequential
> write/rea
> I agree, too. Perhaps we should use the names as they appear in the
> Dublin Core for those properties that are defined there
A big YES!
> - just prepended
> them with "X-nutch-" in order to avoid name-clashes with other
> properties (e.g. blindly copied from the protocol headers).
Another bi
Stefan Groschupf wrote:
Hi geeks,
I have not that much much deep knowledge about the unix file systems,
so my questions what would be the best file system for nutch
distributed file systems data nodes?
Does it make any different using the one or the other file system?
Would reiserFS a good
Stefan Groschupf wrote:
+1!
BTW, did you notice that Jerome committed a patch that makes Content
meta data now case insensitive?
I agree, too. Perhaps we should use the names as they appear in the
Dublin Core for those properties that are defined there - just prepended
them with "X-nutch
+1
A simple solution that provides a standard way to access common meta data.
Great!
--
http://motrech.free.fr/
http://www.frutch.org/
If we are going to make 0.7.2 release I would like to commit
a patch for http://issues.apache.org/jira/browse/NUTCH-112
and probably for some build problems people are raporting (missing src
folder in nutch-extension plugin).
I will look at them in next few days.
Regards
Piotr
Stefan Groschupf w
Hi Folks,
Jerome and I have been talking about an idea to address the current issue
raised by Stefan G. about having a mapping of mimeType->list of pluginIds
rather than mimeType->list of extensionIds in the parse-plugins.xml file.
We've come up with the following proposed update that would seem
Hi geeks,
I have not that much much deep knowledge about the unix file systems,
so my questions what would be the best file system for nutch
distributed file systems data nodes?
Does it make any different using the one or the other file system?
Would reiserFS a good choice?
Thanks for any c
Hi Stefan,
Thanks. Yup, I noticed it and I think it will really help out a lot. Great
job to the both of you :-)
Cheers,
Chris
On 12/13/05 10:59 AM, "Stefan Groschupf" <[EMAIL PROTECTED]> wrote:
> +1!
> BTW, did you notice that Jerome committed a patch that makes Content
> meta data now c
This has been fixed in the mapred branch, but that patch is not in
0.7.1. This alone might be a reason to make a 0.7.2 release.
May we can get fixed some more parser selection related issue until
next days also and get this into a 0.7.2 release.
I would be happy to see some more parser selec
+1!
BTW, did you notice that Jerome committed a patch that makes Content
meta data now case insensitive?
Stefan
Am 13.12.2005 um 18:07 schrieb Chris Mattmann:
Hi Folks,
I was just thinking about the ParseData java.util.Properties
metaata object
and thinking about the way that we store
FYI
This has been fixed in the mapred branch, but that patch is not in
0.7.1. This alone might be a reason to make a 0.7.2 release.
Doug
Original Message
Subject: Crawler submits forms?
Date: Tue, 13 Dec 2005 16:57:34 -
From: Andy Read <[EMAIL PROTECTED]>
Reply-To: nutc
Hi Folks,
I was just thinking about the ParseData java.util.Properties metaata object
and thinking about the way that we store names in there. Currently, people
are free to name their string-based properties anything that they want, such
as having names of "Content-type", "content-TyPe", "CONTENT
Jérôme Charron wrote:
If there is no objection, I will commit these changes in the next hours.
+1. Great stuff! Finally we will be able to predict which parser works
on which content...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[_
If there is no objection, I will commit these changes in the next
hours.
+ 1!!! :-)
Doug Cutting wrote:
Andrzej Bialecki wrote:
Shouldn't this be combined with a HitCollector that collects only the
first-n matches? Otherwise we still need to scan the whole posting
list...
Yes. I was just posting the work-in-progress.
Ok, I just tested IndexSorter for now. It appears t
Hi,
I would like to remove all the hard-coded content-type checks spread over
all the parse plugins.
In fact, the content-type/plugin-id mapping is now centralized in the
parse-plugin.xml file, and there's no
more needs for the parser to check the content-type.
The basic idea was:
1. The developer
Andrzej Bialecki wrote:
Shouldn't this be combined with a HitCollector that collects only the
first-n matches? Otherwise we still need to scan the whole posting list...
Yes. I was just posting the work-in-progress.
We will also need to estimate the total number of matches by
extrapolating li
31 matches
Mail list logo