Re: null lang bug? and patch?

2005-08-31 Thread Jérôme Charron
I did a little digging and it appears that lang ends up being null (couldn't quite track down where lang should have been set). Not sure if it is a proper fix, but changing doc.getField(lang).stringValue() to doc.get(lang), makes my little crawl complete. lang is null cause you don't have

Re: Language identifier plugin questions

2005-08-31 Thread Jérôme Charron
I agree it is important to have the NGramProfile.getSimilarity() method. However, I think it is also important that it is consistent with the scoring that LanguageIdentifier uses, even if LanguageIdentifier optimises the implementation. Looking at the code I see that the two scoring

Re: Language identifier plugin questions

2005-08-31 Thread Jérôme Charron
Tom, I have created the NUTCH-86 issue to report the needed changes in the LanguageIdentifier we discussed in this thread. The issue is available at http://issues.apache.org/jira/browse/NUTCH-86 Regards Jérôme

Re[2]: NDFS question

2005-08-31 Thread Egor Chernodarov
Hello, Doug! I try with mapred branch, but anyway get errors like this: $./nutch ndfs -put ./test.txt /test.txt = 050831 055936 Client connection to 192.168.0.170:9000: starting 050831 060245 Waiting to find target node = On namenode I see : 050831 055936

Re: [Nutch-cvs] svn commit: r240359 - in /lucene/nutch/trunk/src: java/org/apache/nutch/analysis/ java/org/apache/nutch/indexer/ plugin/nutch-extensionpoints/

2005-08-31 Thread Jérôme Charron
I see several instances of 'analySer' in comments/javadoc and some variables. That should probably be changed to american version - analyzer, for consistency's sake. Corrected/Committed (http://svn.apache.org/viewcvs.cgi?rev=265020view=rev) Regards Jérôme -- http://motrech.free.fr/

Re: [jira] Commented: (NUTCH-65) index-more plugin can't parse large set of modification-date

2005-08-31 Thread Michael Nebel
Some more errors (short selection from my logfile). Do we really have to handle the all seperatly or are there any functions/tools for this kind of problem? ...can't parse erroneous date: 12.06.2005 22:02:54 GMT ...can't parse erroneous date: 14.07.2005 GMT ...can't parse erroneous date:

Re: [Nutch Wiki] Update of Committer's Rules by AndrzejBialecki

2005-08-31 Thread Doug Cutting
Apache Wiki wrote: 1. The SVN repository consists of the following areas: a. '''trunk''' [ ... ] a. '''Release-x.x''' branches [ ... ] This should also mention tags, fixed versions of the code where no development occurs. I also would prefer that tag names and branch names are distinct,

Re: [Nutch Wiki] Update of Committer's Rules by AndrzejBialecki

2005-08-31 Thread Piotr Kosiorowski
Doug Cutting wrote: Glancing at other Apache projects in subversion, I see that httpd uses branch names like 2.2.x and tag names like 2.2.4. That's a little cryptic. I propose that we use branch names like branch-2.4 and tag names like release-2.4.1. What do folks think? +1 In fact I

Re: merge mapred to trunk

2005-08-31 Thread Piotr Kosiorowski
Doug Cutting wrote: Currently we have three versions of nutch: trunk, 0.7 and mapred. This increases the chances for conflicts. I would thus like to merge the mapred branch into trunk soon. The soonest I could actually start this is next week. Are there any objections? Doug +1 P.

Re: [Nutch Wiki] Update of Committer's Rules by AndrzejBialecki

2005-08-31 Thread Jérôme Charron
Glancing at other Apache projects in subversion, I see that httpd uses branch names like 2.2.x and tag names like 2.2.4. That's a little cryptic. I propose that we use branch names like branch-2.4 and tag names like release-2.4.1. What do folks think? +1 Jérôme -- http://motrech.free.fr/

Re: null lang bug? and patch?

2005-08-31 Thread Jérôme Charron
I am a bit lost but just a quick check - shouldn't it also be committed in Release-0.7 branch? No, the analyzer extension-point is commited only in trunk. It's a new feature, so I follow Committer's Rules ( http://wiki.apache.org/nutch/Committer's_Rules) ;-) Regards Jérôme --

Re: Automating workflow using ndfs

2005-08-31 Thread Doug Cutting
I assume that in most NDFS-based configurations the production search system will not run out of NDFS. Rather, indexes will be created offline for a deployment (i.e., merging things to create an index per search node), then copied out of NDFS to the local filesystem on a production search

Re: null lang bug? and patch?

2005-08-31 Thread Piotr Kosiorowski
Great - I just thought that it would be better if you look at it - instead of me digging into the code. I wanted to be on the safe side with 0.7.1 release. Regards Piotr Jérôme Charron wrote: I am a bit lost but just a quick check - shouldn't it also be committed in Release-0.7 branch? No,

Fw: PDF support? Does crawl parse pdf files? How do I get it work?

2005-08-31 Thread Diane Palla
Does Nutch have a way to parse pdf files, that is, application/pdf content type files? I noticed a plugin variable setting in default.properties: plugin.pdf=org.apache.nutch.parse.pdf* I never changed this file. Is that the right value? I am using Nutch 0.7. What do I have to do make parse

[jira] Commented: (NUTCH-21) parser plugin for MS PowerPoint slides

2005-08-31 Thread Jerome Charron (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-21?page=comments#action_12320717 ] Jerome Charron commented on NUTCH-21: - Want to commit it, but unit tests failed. parser plugin for MS PowerPoint slides --

Re: [jira] Commented: (NUTCH-65) index-more plugin can't parse large set of modification-date

2005-08-31 Thread Jérôme Charron
Michael, the solution is perhaps to use Jakarta Commons DateUtils.parseDate method: http://jakarta.apache.org/commons/lang/api/org/apache/commons/lang/time/DateUtils.html#parseDate(java.lang.String,%20java.lang.String[]) It will gives something like: Date parsedDate =

Re: merge mapred to trunk

2005-08-31 Thread ogjunk-nutch
Currently we have three versions of nutch: trunk, 0.7 and mapred. This increases the chances for conflicts. I would thus like to merge the mapred branch into trunk soon. The soonest I could actually start this is next week. Are there any objections? I, too, am looking forward to this,

Re: merge mapred to trunk

2005-08-31 Thread Doug Cutting
[EMAIL PROTECTED] wrote: I, too, am looking forward to this, but I am wondering what that will do to Kelvin Tan's recent contribution, especially since I saw that both MapReduce and Kelvin's code change how FetchListEntry works. If merging mapred to trunk means losing Kelvin's changes, then I

Re: merge mapred to trunk

2005-08-31 Thread ogjunk-nutch
--- Doug Cutting [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: I, too, am looking forward to this, but I am wondering what that will do to Kelvin Tan's recent contribution, especially since I saw that both MapReduce and Kelvin's code change how FetchListEntry works. If merging

Re: merge mapred to trunk

2005-08-31 Thread Kelvin Tan
On Wed, 31 Aug 2005 14:37:54 -0700, Doug Cutting wrote: [EMAIL PROTECTED] wrote:  I, too, am looking forward to this, but I am wondering what that  will do to Kelvin Tan's recent contribution, especially since I  saw that both MapReduce and Kelvin's code change how  FetchListEntry works.  If

Re: [jira] Commented: (NUTCH-65) index-more plugin can't parse large set of modification-date

2005-08-31 Thread Michael Nebel
Hi Jérôme, it works great (see the new function bellow). But we'll have to add commons-lang (http://jakarta.apache.org/commons/lang/) to the libraries. Are there any objections? How is the procedure to add it? I'm trying my changes right now (I think, it will take the rest of the night to