Re: VOTE Apache Nutch 2.0 RC1

2012-06-13 Thread Ferdy Galema
Findings about Nutch-2.0 RC 1. The Nutch job jar is not present in the binary archive. This means distributed running of jobs is not supported. I'm not sure if this is a problem (since users can always build one themselves), merely pointing it out. The recently released 1.5 also lacks this job

Re: VOTE Apache Nutch 2.0 RC1

2012-06-13 Thread Ferdy Galema
Hmm please ignore the parse text limited to 100 chars, this is actually not the case. (Only in our branch that has a fix for limiting anchor texts; not yet present in in the nutchgora branch because it still needs polishing). So no need to wait for commits on my part. On Wed, Jun 13, 2012 at

Re: VOTE Apache Nutch 2.0 RC1

2012-06-13 Thread Lewis John Mcgibbney
Hi Seb, As Chris said, the issues you highlight well justify another RC. I can shift it by the end of play today. Thanks very much for having a look through guys Lewis On Tue, Jun 12, 2012 at 11:33 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi Lewis, my first steps with 2.0 (to

[Nutch Wiki] Trivial Update of GORA_HBase by LewisJohnMcgibbney

2012-06-13 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The GORA_HBase page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/GORA_HBase?action=diffrev1=11rev2=12 - This document describes how to get Nutch to use HBase as a

[Nutch Wiki] Trivial Update of GORA_HBase by LewisJohnMcgibbney

2012-06-13 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The GORA_HBase page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/GORA_HBase?action=diffrev1=12rev2=13 This document describes how to get Nutch 2.0 to use

[Nutch Wiki] Trivial Update of GORA_HBase by LewisJohnMcgibbney

2012-06-13 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The GORA_HBase page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/GORA_HBase?action=diffrev1=13rev2=14 valueorg.apache.gora.hbase.store.HBaseStore/value

[Nutch Wiki] Trivial Update of FrontPage by LewisJohnMcgibbney

2012-06-13 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FrontPage page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/FrontPage?action=diffrev1=240rev2=241 * [[NutchMavenSupport|Using Nutch as a Maven dependency]]

[Nutch Wiki] Trivial Update of Nutch2Tutorial by LewisJohnMcgibbney

2012-06-13 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Nutch2Tutorial page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/Nutch2Tutorial New page: = Nutch 2.0 Tutorial =

[Nutch Wiki] Trivial Update of FrontPage by LewisJohnMcgibbney

2012-06-13 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FrontPage page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/FrontPage?action=diffrev1=241rev2=242 * Nutch2Roadmap -- Discussions on the architecture and

Re: VOTE Apache Nutch 2.0 RC1

2012-06-13 Thread Lewis John Mcgibbney
Hi Seb, Quick update On Tue, Jun 12, 2012 at 11:33 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: 1 some guidance would be nice. README.txt points to http://wiki.apache.org/nutch/NutchTutorial which refers to 1.x Please see http://wiki.apache.org/nutch/Nutch2Tutorial which is an update

Suitable Nutch 2.0 Project Description

2012-06-13 Thread Lewis John Mcgibbney
Hi, Seeing as we have the ball rolling with the 2.0 RC. I thought I'd ask about a suitable project descriptor. So far on trunk we have ** Apache Nutch is an open source web-search software project. Stemming from Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a

Re: Suitable Nutch 2.0 Project Description

2012-06-13 Thread Ferdy Galema
Hi, I would remove the 'experimental' notion. Aside from that it's fine with me. Ferdy. On Wed, Jun 13, 2012 at 2:29 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, Seeing as we have the ball rolling with the 2.0 RC. I thought I'd ask about a suitable project descriptor.

[jira] [Commented] (NUTCH-1342) Read time out protocol-http

2012-06-13 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13294429#comment-13294429 ] Ferdy Galema commented on NUTCH-1342: - Do you have any clue as to why

Re: VOTE Apache Nutch 2.0 RC1

2012-06-13 Thread Julien Nioche
Ferdy The Nutch job jar is not present in the binary archive. This means distributed running of jobs is not supported. I'm not sure if this is a problem (since users can always build one themselves), merely pointing it out. The recently released 1.5 also lacks this job jar, so at least no

Re: Suitable Nutch 2.0 Project Description

2012-06-13 Thread Julien Nioche
and and array other document looks like a typo, rest is fine On 13 June 2012 13:45, Ferdy Galema ferdy.gal...@kalooga.com wrote: Hi, I would remove the 'experimental' notion. Aside from that it's fine with me. Ferdy. On Wed, Jun 13, 2012 at 2:29 PM, Lewis John Mcgibbney

[Nutch Wiki] Trivial Update of Nutch2Tutorial by LewisJohnMcgibbney

2012-06-13 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Nutch2Tutorial page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/Nutch2Tutorial?action=diffrev1=2rev2=3

[Nutch Wiki] Trivial Update of Nutch2Tutorial by LewisJohnMcgibbney

2012-06-13 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Nutch2Tutorial page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/Nutch2Tutorial?action=diffrev1=3rev2=4 This document describes how to get Nutch 2.0 to use

Re: VOTE Apache Nutch 2.0 RC1

2012-06-13 Thread Lewis John Mcgibbney
Hi Guys, Whilst updating the Nutch2Tutorial I got thinking that within Gora we don't supply binary distributions of the code, this is because when using Gora a user may wish/require to recompile the code to accomodate config changes etc. We only supply src distributions... Does this principle

Re: Suitable Nutch 2.0 Project Description

2012-06-13 Thread Mattmann, Chris A (388J)
+1 to the description w/o experimental too (I agree with Ferdy). You guys ROCK. Cheers, Chris On Jun 13, 2012, at 5:29 AM, Lewis John Mcgibbney wrote: Hi, Seeing as we have the ball rolling with the 2.0 RC. I thought I'd ask about a suitable project descriptor. So far on trunk we have

Re: VOTE Apache Nutch 2.0 RC1

2012-06-13 Thread Sebastian Nagel
Hi Lewis, Please see http://wiki.apache.org/nutch/Nutch2Tutorial which is an update of Julien's (I think) page on GORA_HBase. Thsi will get you rocking with HBase. The changes between Cassandra, Accumulo and the other data stores are fairly trivial. I'll managed to perform a crawl with 2.0

[jira] [Created] (NUTCH-1390) readdb -url $url throws NPE with gora-cassandra

2012-06-13 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-1390: --- Summary: readdb -url $url throws NPE with gora-cassandra Key: NUTCH-1390 URL: https://issues.apache.org/jira/browse/NUTCH-1390 Project: Nutch

[jira] [Created] (NUTCH-1391) readdb -stats fires java.io.EOFException

2012-06-13 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-1391: --- Summary: readdb -stats fires java.io.EOFException Key: NUTCH-1391 URL: https://issues.apache.org/jira/browse/NUTCH-1391 Project: Nutch Issue

[jira] [Created] (NUTCH-1392) -force and -resume arguments being ignored in ParserJob

2012-06-13 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-1392: --- Summary: -force and -resume arguments being ignored in ParserJob Key: NUTCH-1392 URL: https://issues.apache.org/jira/browse/NUTCH-1392 Project: Nutch

[jira] [Created] (NUTCH-1393) Display consistent usage of GeneratorJob with 1.X

2012-06-13 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-1393: --- Summary: Display consistent usage of GeneratorJob with 1.X Key: NUTCH-1393 URL: https://issues.apache.org/jira/browse/NUTCH-1393 Project: Nutch

[jira] [Created] (NUTCH-1394) backport NUTCH-1232 Remove host field from index-basic

2012-06-13 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-1394: --- Summary: backport NUTCH-1232 Remove host field from index-basic Key: NUTCH-1394 URL: https://issues.apache.org/jira/browse/NUTCH-1394 Project: Nutch

Re: VOTE Apache Nutch 2.0 RC1

2012-06-13 Thread Lewis John Mcgibbney
Hi Sebastian, On Wed, Jun 13, 2012 at 11:30 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: I'll managed to perform a crawl with 2.0 and HBase: it rocks, indeed. Much simpler than 1.x (no segments!). :0) % ./bin/nutch readdb -stats WebTable statistics start WebTableReader:

[jira] [Commented] (NUTCH-1392) -force and -resume arguments being ignored in ParserJob

2012-06-13 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13294730#comment-13294730 ] Lewis John McGibbney commented on NUTCH-1392: - Additionally this issue should