Re: nucth and mahout integration

2012-07-02 Thread Mathijs Homminga
We wrote a custom Nutch parse plugin that uses a Mahout classifier to classify docs. Mathijs Homminga On Jul 1, 2012, at 21:02, Alexander Aristov alexander.aris...@gmail.com wrote: People can you give me some advises? I want to integrate nutch and mahout to classify crawled pages

Re: GSoC2012 Idea: Integrating Nutch With Hama

2012-03-24 Thread Mathijs Homminga
This is interesting, can you elaborate a bit more on this. In what way do you think could Nutch benefit from an implementation in Hama? Mathijs Homminga On 24 mrt. 2012, at 13:55, Apurv Verma wrote: Hi, Would the Nutch community be interested in integrating Nutch and Hama. Apache Hama

[jira] [Issue Comment Edited] (NUTCH-882) Design a Host table in GORA

2012-03-12 Thread Mathijs Homminga (Issue Comment Edited) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13227859#comment-13227859 ] Mathijs Homminga edited comment on NUTCH-882 at 3/12/12 8:29 PM

[jira] [Commented] (NUTCH-882) Design a Host table in GORA

2012-03-12 Thread Mathijs Homminga (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13227859#comment-13227859 ] Mathijs Homminga commented on NUTCH-882: Hi guys, I have second thoughts

[jira] [Commented] (NUTCH-882) Design a Host table in GORA

2012-03-08 Thread Mathijs Homminga (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225486#comment-13225486 ] Mathijs Homminga commented on NUTCH-882: Status: I have updated the patches

[jira] [Updated] (NUTCH-1290) crawlId not supported by all Tools

2012-03-06 Thread Mathijs Homminga (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mathijs Homminga updated NUTCH-1290: Attachment: NUTCH-1290.patch This patch modifies the following files in order to support

[jira] [Created] (NUTCH-1290) crawlId not supported by all Tools

2012-02-28 Thread Mathijs Homminga (Created) (JIRA)
Reporter: Mathijs Homminga Priority: Minor Fix For: nutchgora See also: https://issues.apache.org/jira/browse/NUTCH-907 The StorageUtils class exposes a createDataStore method which uses the default schema for a persistent class specified in the Gora configuration

Re: [nutchgora] AbstractFetchSchedule.forceFetch method resets fetch status

2012-02-28 Thread Mathijs Homminga
: https://issues.apache.org/jira/browse/NUTCH-578 https://issues.apache.org/jira/browse/NUTCH-1245 Is you issue similar to these? On Tuesday 28 February 2012 14:09:25 Mathijs Homminga wrote: Hi, Does anyone know why the AbstractFetchSchedule.forceFetch method sets the page.status

[jira] [Commented] (NUTCH-1289) In distributed mode URL's are not partitioned

2012-02-27 Thread Mathijs Homminga (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217172#comment-13217172 ] Mathijs Homminga commented on NUTCH-1289: - Nice catch. The PartitionUrlByHost

Re: Nutch ignores robots.txt

2011-11-16 Thread Mathijs Homminga
Hi Lewis, I believe that you can find the robots.txt of the site here: http://www.kinoundco.de/robots.txt I think he followed the instructions at http://lucene.apache.org/nutch/bot.html (this outdated URL is still in the HttpBase.java btw) correctly. My guess is that the guys at pixray.com have

Re: Nutch ignores robots.txt

2011-11-03 Thread Mathijs Homminga
Hello Max, (Besides the fact that the this client seems to have a broken random URL generator) Crawlers (like Nutch clients) may not always obey robot rules. If Nutch is not configured properly, it will not recognize your Nutch entry in your robots.txt file. If the requests come from a

[jira] [Commented] (NUTCH-882) Design a Host table in GORA

2011-10-30 Thread Mathijs Homminga (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13139610#comment-13139610 ] Mathijs Homminga commented on NUTCH-882: Julien, did you make a start with I'll