)
@@ -174,6 +174,7 @@
} else {
currentJob.setNumReduceTasks(numTasks);
}
+currentJob.waitForCompletion(true);
ToolUtil.recordJobStatus(null, currentJob, results);
return results;
}
Alexis
/NUTCH-899 which
is the same problem. I tried to come up with a JUnit test but it is
still rather imperfect (I want to use
org.apache.nutch.util.CrawTestUtil.getServer for it). The whole patch
is here:
https://issues.apache.org/jira/secure/attachment/12466548/httpContentLimit.patch
Alexis
of the test. It worked for me after I patched a
few stuff. They are described throughout the blog entry or in this new
JIRA-950 issue which, among others, reopens JIRA-899.
Hope this helps.
Alexis.
participate
please refer to Nutch 2.0 section in the wiki. There are many ways to
contribute: send a message on the mailing-list, create an issue on
JIRA while attaching your patch to it or not, update the wiki...
Give it a shot!
Alexis
http://techvineyard.blogspot.com
On Tue, Feb 15, 2011 at
Hi, libthrift is a dependency of cassandra-thrift, as listed here:
http://mvnrepository.com/artifact/org.apache.cassandra/cassandra-thrift/0.8.1
During Nutch build, you have to manually tweak the Ivy configuration
depending on your choice of the Gora store, in this case Cassandra.
Basically you ne
his line to get the hector dependency:
>
> conf="*->default"/>
>
> -Original Message-
> From: Alexis [mailto:alexis.detregl...@gmail.com]
> Sent: Monday, August 01, 2011 2:28 PM
> To: dev@nutch.apache.org
> Subject: Re: Nutch 2 and Cassandra
Hi Tom,
I'm having the same issue.
The two missing jars in the nutch-2.0-dev.job, cassandra-all-0.8.0.jar
and hector-core-0.8.0-1.jar, have been manually uploaded for the Gora
build to work into gora-cassandra/lib-ext SVN directory, because for
some reason I did not get them downloaded through Mav
order to implement search. They use HBase which is, by
the way, Nutch 2.0 compatible.
Take at look:
http://developer.yahoo.com/events/hadoopsummit2011/agenda.html#22 (sorry I
don't think any video of the summit is available yet, not sure why)
Alexis
On Mon, Sep 19, 2011 at 1:05 AM, Jul
Dear Ferdy,
This mapping is user defined. It specifies where Avro fields required
by Nutch jobs are stored in HBase.
You can tweak the schema according to this kind of considerations by
editing the config file.
So content is populated by the Fetcher job (writes) that downloads the
web page. It i
[
https://issues.apache.org/jira/browse/NUTCH-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928788#action_12928788
]
Alexis commented on NUTCH-873:
--
It did not work as seamless for me. The gora build creat
[
https://issues.apache.org/jira/browse/NUTCH-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928788#action_12928788
]
Alexis edited comment on NUTCH-873 at 11/5/10 3:48 PM:
---
It did
[
https://issues.apache.org/jira/browse/NUTCH-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928788#action_12928788
]
Alexis edited comment on NUTCH-873 at 11/5/10 3:51 PM:
---
It did
[
https://issues.apache.org/jira/browse/NUTCH-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928788#action_12928788
]
Alexis edited comment on NUTCH-873 at 11/5/10 3:52 PM:
---
It did
[
https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928896#action_12928896
]
Alexis commented on NUTCH-880:
--
This revision introduced a bug in the nutch inject command
[
https://issues.apache.org/jira/browse/NUTCH-899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970336#action_12970336
]
Alexis commented on NUTCH-899:
--
I ran into the exact same issue, with MySQL. The blob co
[
https://issues.apache.org/jira/browse/NUTCH-899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexis updated NUTCH-899:
-
Attachment: httpContentLimit.patch
We stick with the default gora schema for the MySQL backend, which says
Reporter: Alexis
1. crawl command (nutch1.patch)
The class was renamed to Crawler but the references to it were not updated.
2. URL filter (nutch2.patch)
This avoids a NPE on bogus urls which host do not have a suffix.
3. Content-Length limit (nutch3.patch)
This is related to NUTCH-899
[
https://issues.apache.org/jira/browse/NUTCH-950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexis updated NUTCH-950:
-
Attachment: nutch4.patch
> Content-Length limit, URL filter and few minor iss
[
https://issues.apache.org/jira/browse/NUTCH-950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexis updated NUTCH-950:
-
Attachment: nutch3.patch
nutch2.patch
nutch1.patch
> Content-Length limit,
Ivy configuration
-
Key: NUTCH-955
URL: https://issues.apache.org/jira/browse/NUTCH-955
Project: Nutch
Issue Type: Improvement
Components: build
Affects Versions: 2.0
Reporter: Alexis
As mentioned
[
https://issues.apache.org/jira/browse/NUTCH-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexis updated NUTCH-955:
-
Attachment: ivy.patch
In the patch, the required dependencies for MySQL and HBase are included in the
Ivy config
[
https://issues.apache.org/jira/browse/NUTCH-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979525#action_12979525
]
Alexis edited comment on NUTCH-955 at 1/10/11 5:27 AM:
---
In the p
[
https://issues.apache.org/jira/browse/NUTCH-950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexis resolved NUTCH-950.
--
Resolution: Fixed
Fix Version/s: 2.0
Sorry I missed the Ivy configuration file in the plugin directory
soldindex issues
Key: NUTCH-956
URL: https://issues.apache.org/jira/browse/NUTCH-956
Project: Nutch
Issue Type: Bug
Components: indexer
Affects Versions: 2.0
Reporter: Alexis
I ran into a few
[
https://issues.apache.org/jira/browse/NUTCH-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexis updated NUTCH-956:
-
Attachment: solr.patch
Here are the changes:
- Avoid multiple values for id field. (NUTCH-819)
- Allow multiple
[
https://issues.apache.org/jira/browse/NUTCH-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexis updated NUTCH-956:
-
Summary: solrindex issues (was: soldindex issues)
> solrindex issues
>
>
>
[
https://issues.apache.org/jira/browse/NUTCH-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983125#action_12983125
]
Alexis commented on NUTCH-955:
--
Sorry please disregard the nutch.root first bullet in
Parsing takes up 100% CPU
-
Key: NUTCH-965
URL: https://issues.apache.org/jira/browse/NUTCH-965
Project: Nutch
Issue Type: Improvement
Components: parser
Reporter: Alexis
The issue you
[
https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexis updated NUTCH-965:
-
Attachment: parserJob.patch
In the parser mapper, compare Content-Length header to the size of the content
[
https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexis updated NUTCH-965:
-
Summary: Skip parsing for truncated documents (was: Parsing takes up 100%
CPU)
> Skip parsing for trunca
[
https://issues.apache.org/jira/browse/NUTCH-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064148#comment-13064148
]
Alexis commented on NUTCH-956:
--
I do get the NPE when indexing this url
[
https://issues.apache.org/jira/browse/NUTCH-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexis updated NUTCH-956:
-
Attachment: solr.patch2
- NPE related to content-type field
- tld field in Solr schema
- string comparison in
32 matches
Mail list logo