Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-30 Thread Mattmann, Chris A (388J)
Hi Phil, Thanks for your comments. Mine below: >> Unfortunately some parts of the documentation on Nutch (namely the >> tutorial, >> and other parts of the static site) have been out of date for a while. This >> has occurred really independent of the releases, and independent of the >> wiki >> [1

Re: nutch crawl issue

2010-04-30 Thread Phil Barnett
This sounds exactly like what I have been experiencing. On Wed, Apr 28, 2010 at 12:39 AM, matthew a. grisius wrote: > using Nutch nightly build nutch-2010-04-27_04-00-28: > > I am trying to bin/nutch crawl a single html file generated by javadoc > and no links are followed. I verified this with b

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-30 Thread Phil Barnett
Oh yeah, I built a presentation and gave it to our local Linux User Group meeting. You might find it useful: http://leap-cf.org/presentations/nutch/NutchWebCrawler.odp On Sat, May 1, 2010 at 2:10 AM, Phil Barnett wrote: > > > On Wed, Apr 28, 2010 at 10:27 AM, matthew a. grisius > wrote: > >> I

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-30 Thread Phil Barnett
On Wed, Apr 28, 2010 at 10:27 AM, matthew a. grisius wrote: > I also share many of Phil's sentiments. I really want the project > (bin/nutch crawl) to work for me as well and I want to help somehow. I > would like to share a 5gb 'intranet' web site with ~50 people. And I > have not graduated to ma

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-30 Thread Phil Barnett
On Wed, Apr 28, 2010 at 11:01 AM, Mattmann, Chris A (388J) < chris.a.mattm...@jpl.nasa.gov> wrote: > > Unfortunately some parts of the documentation on Nutch (namely the > tutorial, > and other parts of the static site) have been out of date for a while. This > has occurred really independent of t

Re: JobTracker gets stuck with DFS problems

2010-04-30 Thread Andrzej Bialecki
On 2010-04-30 20:09, Emmanuel de Castro Santana wrote: > Hi All > > We are using Nutch to crawl ~500K pages with a 3 node cluster, each node > features a dual core processor running with 4Gb RAM and circa 100Gb storage. > All nodes run on CentOS. > > These 500K pages are scattered into several si

JobTracker gets stuck with DFS problems

2010-04-30 Thread Emmanuel de Castro Santana
Hi All We are using Nutch to crawl ~500K pages with a 3 node cluster, each node features a dual core processor running with 4Gb RAM and circa 100Gb storage. All nodes run on CentOS. These 500K pages are scattered into several sites, each one of them having from 5k up to 200k pages. For each site