I did a little digging and it appears that lang ends
up being null (couldn't quite track down where lang
should have been set). Not sure if it is a proper
fix, but changing doc.getField(lang).stringValue()
to doc.get(lang), makes my little crawl complete.
lang is null cause you don't have
I agree it is important to have the NGramProfile.getSimilarity() method.
However, I think it is also important that it is consistent with the
scoring
that LanguageIdentifier uses, even if LanguageIdentifier optimises the
implementation. Looking at the code I see that the two scoring
Tom,
I have created the NUTCH-86 issue to report the needed changes in the
LanguageIdentifier we discussed in this thread.
The issue is available at http://issues.apache.org/jira/browse/NUTCH-86
Regards
Jérôme
Hello, Doug!
I try with mapred branch, but anyway get errors like this:
$./nutch ndfs -put ./test.txt /test.txt
=
050831 055936 Client connection to 192.168.0.170:9000: starting
050831 060245 Waiting to find target node
=
On namenode I see :
050831 055936
I see several instances of 'analySer' in comments/javadoc and some
variables. That should probably be changed to american version -
analyzer, for consistency's sake.
Corrected/Committed
(http://svn.apache.org/viewcvs.cgi?rev=265020view=rev)
Regards
Jérôme
--
http://motrech.free.fr/
Some more errors (short selection from my logfile). Do we really have to
handle the all seperatly or are there any functions/tools for this kind
of problem?
...can't parse erroneous date: 12.06.2005 22:02:54 GMT
...can't parse erroneous date: 14.07.2005 GMT
...can't parse erroneous date:
Apache Wiki wrote:
1. The SVN repository consists of the following areas:
a. '''trunk''' [ ... ]
a. '''Release-x.x''' branches [ ... ]
This should also mention tags, fixed versions of the code where no
development occurs.
I also would prefer that tag names and branch names are distinct,
Doug Cutting wrote:
Glancing at other Apache projects in subversion, I see that httpd uses
branch names like 2.2.x and tag names like 2.2.4. That's a little
cryptic. I propose that we use branch names like branch-2.4 and tag
names like release-2.4.1. What do folks think?
+1
In fact I
Doug Cutting wrote:
Currently we have three versions of nutch: trunk, 0.7 and mapred. This
increases the chances for conflicts. I would thus like to merge the
mapred branch into trunk soon. The soonest I could actually start this
is next week. Are there any objections?
Doug
+1
P.
Glancing at other Apache projects in subversion, I see that httpd uses
branch names like 2.2.x and tag names like 2.2.4. That's a little
cryptic. I propose that we use branch names like branch-2.4 and tag
names like release-2.4.1. What do folks think?
+1
Jérôme
--
http://motrech.free.fr/
I am a bit lost but just a quick check - shouldn't it also be committed
in Release-0.7 branch?
No, the analyzer extension-point is commited only in trunk.
It's a new feature, so I follow Committer's Rules (
http://wiki.apache.org/nutch/Committer's_Rules)
;-)
Regards
Jérôme
--
I assume that in most NDFS-based configurations the production search
system will not run out of NDFS. Rather, indexes will be created
offline for a deployment (i.e., merging things to create an index per
search node), then copied out of NDFS to the local filesystem on a
production search
Great - I just thought that it would be better if you look at it -
instead of me digging into the code. I wanted to be on the safe side
with 0.7.1 release.
Regards
Piotr
Jérôme Charron wrote:
I am a bit lost but just a quick check - shouldn't it also be committed
in Release-0.7 branch?
No,
Does Nutch have a way to parse pdf files, that is, application/pdf
content type files?
I noticed a plugin variable setting in default.properties:
plugin.pdf=org.apache.nutch.parse.pdf*
I never changed this file.
Is that the right value?
I am using Nutch 0.7.
What do I have to do make parse
[
http://issues.apache.org/jira/browse/NUTCH-21?page=comments#action_12320717 ]
Jerome Charron commented on NUTCH-21:
-
Want to commit it, but unit tests failed.
parser plugin for MS PowerPoint slides
--
Michael,
the solution is perhaps to use Jakarta Commons DateUtils.parseDate method:
http://jakarta.apache.org/commons/lang/api/org/apache/commons/lang/time/DateUtils.html#parseDate(java.lang.String,%20java.lang.String[])
It will gives something like:
Date parsedDate =
Currently we have three versions of nutch: trunk, 0.7 and mapred.
This
increases the chances for conflicts. I would thus like to merge the
mapred branch into trunk soon. The soonest I could actually start
this is next week. Are there any objections?
I, too, am looking forward to this,
[EMAIL PROTECTED] wrote:
I, too, am looking forward to this, but I am wondering what that will
do to Kelvin Tan's recent contribution, especially since I saw that
both MapReduce and Kelvin's code change how FetchListEntry works. If
merging mapred to trunk means losing Kelvin's changes, then I
--- Doug Cutting [EMAIL PROTECTED] wrote:
[EMAIL PROTECTED] wrote:
I, too, am looking forward to this, but I am wondering what that
will
do to Kelvin Tan's recent contribution, especially since I saw that
both MapReduce and Kelvin's code change how FetchListEntry works.
If
merging
On Wed, 31 Aug 2005 14:37:54 -0700, Doug Cutting wrote:
[EMAIL PROTECTED] wrote:
I, too, am looking forward to this, but I am wondering what that
will do to Kelvin Tan's recent contribution, especially since I
saw that both MapReduce and Kelvin's code change how
FetchListEntry works. If
Hi Jérôme,
it works great (see the new function bellow). But we'll have to add
commons-lang (http://jakarta.apache.org/commons/lang/) to the libraries.
Are there any objections? How is the procedure to add it?
I'm trying my changes right now (I think, it will take the rest of the
night to
21 matches
Mail list logo