We wrote a custom Nutch parse plugin that uses a Mahout classifier to classify
docs.
Mathijs Homminga
On Jul 1, 2012, at 21:02, Alexander Aristov alexander.aris...@gmail.com wrote:
People
can you give me some advises?
I want to integrate nutch and mahout to classify crawled pages
This is interesting, can you elaborate a bit more on this. In what way do you
think could Nutch benefit from an implementation in Hama?
Mathijs Homminga
On 24 mrt. 2012, at 13:55, Apurv Verma wrote:
Hi,
Would the Nutch community be interested in integrating Nutch and Hama.
Apache Hama
[
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13227859#comment-13227859
]
Mathijs Homminga edited comment on NUTCH-882 at 3/12/12 8:29 PM
[
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13227859#comment-13227859
]
Mathijs Homminga commented on NUTCH-882:
Hi guys,
I have second thoughts
[
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225486#comment-13225486
]
Mathijs Homminga commented on NUTCH-882:
Status:
I have updated the patches
[
https://issues.apache.org/jira/browse/NUTCH-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mathijs Homminga updated NUTCH-1290:
Attachment: NUTCH-1290.patch
This patch modifies the following files in order to support
Reporter: Mathijs Homminga
Priority: Minor
Fix For: nutchgora
See also: https://issues.apache.org/jira/browse/NUTCH-907
The StorageUtils class exposes a createDataStore method which uses the default
schema for a persistent class specified in the Gora configuration
:
https://issues.apache.org/jira/browse/NUTCH-578
https://issues.apache.org/jira/browse/NUTCH-1245
Is you issue similar to these?
On Tuesday 28 February 2012 14:09:25 Mathijs Homminga wrote:
Hi,
Does anyone know why the AbstractFetchSchedule.forceFetch method sets the
page.status
[
https://issues.apache.org/jira/browse/NUTCH-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217172#comment-13217172
]
Mathijs Homminga commented on NUTCH-1289:
-
Nice catch. The PartitionUrlByHost
Hi Lewis,
I believe that you can find the robots.txt of the site here:
http://www.kinoundco.de/robots.txt
I think he followed the instructions at http://lucene.apache.org/nutch/bot.html
(this outdated URL is still in the HttpBase.java btw) correctly.
My guess is that the guys at pixray.com have
Hello Max,
(Besides the fact that the this client seems to have a broken random URL
generator)
Crawlers (like Nutch clients) may not always obey robot rules. If Nutch is not
configured properly, it will not recognize your Nutch entry in your robots.txt
file.
If the requests come from a
[
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13139610#comment-13139610
]
Mathijs Homminga commented on NUTCH-882:
Julien, did you make a start with I'll
12 matches
Mail list logo