Hello Nutch,
perhaps you are interested in using my language detection training
data harvester to support more languages with your current
implementation. It downloads the Wikipedia article of the home
country for each language to be trained, in all languages that should
be trained.
As
With NUTCH-233 the issue is independent of Hadoop and lies with the
regex-urlfilter. The last solution posted in JIRA gives you more room to work
with, it allowed myself to fetch a segment over 1-2 million but I ran into the
same issue when the segment approached 10 million in size.
Unless you
> Dennis Kubes wrote:
>> I was looking through the JIRA to try and help create a list for this
>> release and to say the least it is a little overwhelming. It looks
>> like there are 183 issues total with 152 being unassigned. What has
>> been the current process for testing/committing issues th
Great, thanks a lot.
I have started a complete Nutch cycle (generate, fetch, updatedb, invertlinks,
index and dedup) on a 13 million document segment, and this should take no
longer then a couple days. I will let you know of any problems, but hopefully
it will work out with no errors at all.
[
https://issues.apache.org/jira/browse/NUTCH-167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrzej Bialecki closed NUTCH-167.
---
Resolution: Fixed
Fix Version/s: 0.9.0
Assignee: Andrzej Bialecki
Patch appli
Andrzej
This feature is not critical and that's a mistake from my part. After several
more testing, we have found that this version was not stable enough yet. We are
working on a stable version that should be uploaded as soon as we have it done.
Armel
-Original Message-
From: Andrzej B
[
https://issues.apache.org/jira/browse/NUTCH-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrzej Bialecki updated NUTCH-427:
Priority: Major (was: Critical)
New features are not critical. This plugin uses an LGPL lib
Dennis Kubes wrote:
I was looking through the JIRA to try and help create a list for this
release and to say the least it is a little overwhelming. It looks
like there are 183 issues total with 152 being unassigned. What has
been the current process for testing/committing issues that have
p
I have gotten this working. A little bit of tweaking was involved but
everything works fine now.
Steve
-Original Message-
From: Steve Severance [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 07, 2007 2:19 PM
To: nutch-dev@lucene.apache.org
Subject: RE: 0.9 release
Also one thing that c
[
https://issues.apache.org/jira/browse/NUTCH-437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrzej Bialecki closed NUTCH-437.
---
Resolution: Fixed
Fix Version/s: (was: 0.8.2)
Fixed in rev. 515791 as part of the H
I was looking through the JIRA to try and help create a list for this
release and to say the least it is a little overwhelming. It looks like
there are 183 issues total with 152 being unassigned. What has been the
current process for testing/committing issues that have patches
attached?
Ch
Sean Dean wrote:
As it stands now with whats in trunk under 0.9-dev, one of the biggest problems is the
version of Hadoop we have included. It fails on anything above 200k URLs, and should be
considered a "blocker" issue.
Its my understanding that Andrzej has a newer Hadoop JAR with some cust
[
https://issues.apache.org/jira/browse/NUTCH-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478920
]
Steve Severance commented on NUTCH-296:
---
I know the commiters are hard at work on the 0.9.0 release but I have b
As it stands now with whats in trunk under 0.9-dev, one of the biggest problems
is the version of Hadoop we have included. It fails on anything above 200k
URLs, and should be considered a "blocker" issue.
Its my understanding that Andrzej has a newer Hadoop JAR with some custom
patches applied
Also one thing that comes to my mind as I have been struggling with it,
there is no upgrade path that I know of from 0.8.x to 0.9.0. I followed the
directions in the wiki and that did not work. I later found in a mailing
list post that everything needs to be regenerated. There needs to be some
guid
> 2. Any outstanding things that need to get done that aren't really code that
> needs to get committed, e.g., things we need to close the loop on
One thing that comes to my mind is the web site, we have specifically
tutorials for 0.7.x and 0.8.x it might be confusing for users if we left
it as is
[
https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478854
]
Doug Cutting commented on NUTCH-455:
Alternately, we could define it as an error to attempt to dedup by a tokenize
[
https://issues.apache.org/jira/browse/NUTCH-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrzej Bialecki closed NUTCH-432.
---
Resolution: Fixed
Assignee: Andrzej Bialecki
Applied the patch suggested in HADOOP-1081
Hi,
[Cross-posting to announce the Tika proposal, please use
general@incubator.apache.org for followup discussion.]
This is a proposal to start a content analysis toolkit project in the
Apache Incubator. The live version of the proposal is available at
http://wiki.apache.org/incubator/TikaPropos
Hi Folks,
As suggested by Sami, I'm moving this discussion to the nutch-dev list.
Seems like I am the guy that is going to do the Nutch 0.9 release :-)
However, it seems also that there are some issues that need to be sorted out
first. I'd like to follow up to Andrzej's email about loose ends be
Hiya All,
I'm trying to set up Nutch (well Nutchwax actually but that's another
story) and I run into the following problem:
2007-03-07 14:52:15,287 INFO org.apache.hadoop.fs.DFSClient: Could not
obtain block from any node: java.io.IOException: No live nodes contain
current block
This
Nathan,
Sorry I didn't get back to you sooner. There are a few messy things that we
need to clear up in this plugin, as previously commented by Sami Siren. As for
the jdom, we need to change the plugin configuration so that it points to the
existing jdom library. Glad you got it to work thou
[
https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Enis Soztutar updated NUTCH-455:
Attachment: IndexSearcherCacheWarm.patch
the patch to the IndexSearcher is attached
> dedup on toke
dedup on tokenized fields is faulty
---
Key: NUTCH-455
URL: https://issues.apache.org/jira/browse/NUTCH-455
Project: Nutch
Issue Type: Bug
Components: searcher
Affects Versions: 0.9.0
24 matches
Mail list logo