language identification training data

2007-03-07 Thread karl wettin
Hello Nutch, perhaps you are interested in using my language detection training data harvester to support more languages with your current implementation. It downloads the Wikipedia article of the home country for each language to be trained, in all languages that should be trained. As

Re: 0.9 release

2007-03-07 Thread Sean Dean
With NUTCH-233 the issue is independent of Hadoop and lies with the regex-urlfilter. The last solution posted in JIRA gives you more room to work with, it allowed myself to fetch a segment over 1-2 million but I ran into the same issue when the segment approached 10 million in size. Unless you

Re: 0.9 release

2007-03-07 Thread Dennis Kubes
> Dennis Kubes wrote: >> I was looking through the JIRA to try and help create a list for this >> release and to say the least it is a little overwhelming. It looks >> like there are 183 issues total with 152 being unassigned. What has >> been the current process for testing/committing issues th

Re: 0.9 release

2007-03-07 Thread Sean Dean
Great, thanks a lot. I have started a complete Nutch cycle (generate, fetch, updatedb, invertlinks, index and dedup) on a 13 million document segment, and this should take no longer then a couple days. I will let you know of any problems, but hopefully it will work out with no errors at all.

[jira] Closed: (NUTCH-167) Observation of directive

2007-03-07 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-167. --- Resolution: Fixed Fix Version/s: 0.9.0 Assignee: Andrzej Bialecki Patch appli

RE: [jira] Updated: (NUTCH-427) protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol impl

2007-03-07 Thread Armel T. Nene
Andrzej This feature is not critical and that's a mistake from my part. After several more testing, we have found that this version was not stable enough yet. We are working on a stable version that should be uploaded as soon as we have it done. Armel -Original Message- From: Andrzej B

[jira] Updated: (NUTCH-427) protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implment

2007-03-07 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-427: Priority: Major (was: Critical) New features are not critical. This plugin uses an LGPL lib

Re: 0.9 release

2007-03-07 Thread Andrzej Bialecki
Dennis Kubes wrote: I was looking through the JIRA to try and help create a list for this release and to say the least it is a little overwhelming. It looks like there are 183 issues total with 152 being unassigned. What has been the current process for testing/committing issues that have p

RE: 0.9 release

2007-03-07 Thread Steve Severance
I have gotten this working. A little bit of tweaking was involved but everything works fine now. Steve -Original Message- From: Steve Severance [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 07, 2007 2:19 PM To: nutch-dev@lucene.apache.org Subject: RE: 0.9 release Also one thing that c

[jira] Closed: (NUTCH-437) MapFile in Hadoop Trunk has changed, must update references

2007-03-07 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-437. --- Resolution: Fixed Fix Version/s: (was: 0.8.2) Fixed in rev. 515791 as part of the H

Re: 0.9 release

2007-03-07 Thread Dennis Kubes
I was looking through the JIRA to try and help create a list for this release and to say the least it is a little overwhelming. It looks like there are 183 issues total with 152 being unassigned. What has been the current process for testing/committing issues that have patches attached? Ch

Re: 0.9 release

2007-03-07 Thread Andrzej Bialecki
Sean Dean wrote: As it stands now with whats in trunk under 0.9-dev, one of the biggest problems is the version of Hadoop we have included. It fails on anything above 200k URLs, and should be considered a "blocker" issue. Its my understanding that Andrzej has a newer Hadoop JAR with some cust

[jira] Commented: (NUTCH-296) Image Search

2007-03-07 Thread Steve Severance (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478920 ] Steve Severance commented on NUTCH-296: --- I know the commiters are hard at work on the 0.9.0 release but I have b

Re: 0.9 release

2007-03-07 Thread Sean Dean
As it stands now with whats in trunk under 0.9-dev, one of the biggest problems is the version of Hadoop we have included. It fails on anything above 200k URLs, and should be considered a "blocker" issue. Its my understanding that Andrzej has a newer Hadoop JAR with some custom patches applied

RE: 0.9 release

2007-03-07 Thread Steve Severance
Also one thing that comes to my mind as I have been struggling with it, there is no upgrade path that I know of from 0.8.x to 0.9.0. I followed the directions in the wiki and that did not work. I later found in a mailing list post that everything needs to be regenerated. There needs to be some guid

Re: 0.9 release

2007-03-07 Thread Sami Siren
> 2. Any outstanding things that need to get done that aren't really code that > needs to get committed, e.g., things we need to close the loop on One thing that comes to my mind is the web site, we have specifically tutorials for 0.7.x and 0.8.x it might be confusing for users if we left it as is

[jira] Commented: (NUTCH-455) dedup on tokenized fields is faulty

2007-03-07 Thread Doug Cutting (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478854 ] Doug Cutting commented on NUTCH-455: Alternately, we could define it as an error to attempt to dedup by a tokenize

[jira] Closed: (NUTCH-432) JAVA_PLATFORM with spaces (i.e. Mac OS X-ppc-32) breaks bin/nutch script

2007-03-07 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-432. --- Resolution: Fixed Assignee: Andrzej Bialecki Applied the patch suggested in HADOOP-1081

[PROPOSAL] Tika, a content analysis toolkit

2007-03-07 Thread Jukka Zitting
Hi, [Cross-posting to announce the Tika proposal, please use general@incubator.apache.org for followup discussion.] This is a proposal to start a content analysis toolkit project in the Apache Incubator. The live version of the proposal is available at http://wiki.apache.org/incubator/TikaPropos

0.9 release

2007-03-07 Thread Chris Mattmann
Hi Folks, As suggested by Sami, I'm moving this discussion to the nutch-dev list. Seems like I am the guy that is going to do the Nutch 0.9 release :-) However, it seems also that there are some issues that need to be sorted out first. I'd like to follow up to Andrzej's email about loose ends be

No live nodes contain current block

2007-03-07 Thread Pope, Jackson
Hiya All, I'm trying to set up Nutch (well Nutchwax actually but that's another story) and I run into the following problem: 2007-03-07 14:52:15,287 INFO org.apache.hadoop.fs.DFSClient: Could not obtain block from any node: java.io.IOException: No live nodes contain current block This

RE: [jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

2007-03-07 Thread Alan Tanaman
Nathan, Sorry I didn't get back to you sooner. There are a few messy things that we need to clear up in this plugin, as previously commented by Sami Siren. As for the jdom, we need to change the plugin configuration so that it points to the existing jdom library. Glad you got it to work thou

[jira] Updated: (NUTCH-455) dedup on tokenized fields is faulty

2007-03-07 Thread Enis Soztutar (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-455: Attachment: IndexSearcherCacheWarm.patch the patch to the IndexSearcher is attached > dedup on toke

[jira] Created: (NUTCH-455) dedup on tokenized fields is faulty

2007-03-07 Thread Enis Soztutar (JIRA)
dedup on tokenized fields is faulty --- Key: NUTCH-455 URL: https://issues.apache.org/jira/browse/NUTCH-455 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 0.9.0