Re: About Apache Nutch 1.1 Final Release

2010-04-10 Thread Phil Barnett
on't have more specifics, but I'm at home, the setup is at work and I had to revert to get things back running. But I built a dev machine so I can play with 1.1 and get more specific. Phil Barnett Senior Analyst Walt Disney World.

Re: About Apache Nutch 1.1 Final Release

2010-04-10 Thread Phil Barnett
On Sat, 2010-04-10 at 18:22 +0200, Andrzej Bialecki wrote: > On 2010-04-10 17:49, Phil Barnett wrote: > > On Thu, 2010-04-08 at 21:31 -0700, Mattmann, Chris A (388J) wrote: > >> Hi there, > >> > >> Well as soon as we have 3 +1 binding VOTEs. Right now I'm t

Re: About Apache Nutch 1.1 Final Release

2010-04-13 Thread Phil Barnett
On Sat, Apr 10, 2010 at 11:04 PM, Phil Barnett wrote: > On Sat, 2010-04-10 at 18:22 +0200, Andrzej Bialecki wrote: > > On 2010-04-10 17:49, Phil Barnett wrote: > > > On Thu, 2010-04-08 at 21:31 -0700, Mattmann, Chris A (388J) wrote: > > >> Hi there, > > &g

Re: nutch 1.1 crawl d/n complete issue

2010-04-15 Thread Phil Barnett
On Thu, 2010-04-15 at 15:34 -0400, matthew a. grisius wrote: > Generator: jobtracker is 'local', generating exactly one partition. > Generator: 0 records selected for fetching, exiting ... > Exception in thread "main" java.lang.NullPointerException > at org.apache.nutch.crawl.Crawl.main(Cr

Re: nutch 1.1 crawl d/n complete issue

2010-04-16 Thread Phil Barnett
The Fix. In line 131 of Crawl.java Generate no longer returns segments like it used to. Now it returns segs. line 131 needs to read If (segs == null) Instead of the current If (segments == null) After that change and a recompile, crawl is working just fine.

Re: About Apache Nutch 1.1 Final Release

2010-04-16 Thread Phil Barnett
On Sat, 2010-04-10 at 18:22 +0200, Andrzej Bialecki wrote: > More details on this (your environment, OS, JDK version) and > logs/stacktraces would be highly appreciated! You mentioned that you > have some scripts - if you could extract relevant portions from them (or > copy the scripts) it would h

Question about crawler.

2010-04-20 Thread Phil Barnett
Is there some place to tell why the crawler has rejected a page? I'm trying to get 1.1 working and basically it doesn't seem to crawl the same way that 1.0 does. I have tika included in the parse- section of conf/nutch-site.xml I have DEBUG set for all the crawl sections, but it doesn't really sa

Re: Question about crawler.

2010-04-20 Thread Phil Barnett
On Tue, Apr 20, 2010 at 7:02 PM, wrote: > Hi Phil, > > > -Original Message- > > From: Phil Barnett [mailto:ph...@philb.us] > > Sent: Wednesday, 21 April 2010 8:39 AM > > To: nutch-user@lucene.apache.org > > Subject: Question about crawler. > &g

Re: Question about crawler.

2010-04-20 Thread Phil Barnett
I meant the production 1.0 server is still crawling them.

conf questions

2010-04-20 Thread Phil Barnett
What's the difference between regex-urlfilters.xml and crawl-urlfilters.xml. What uses what?

Scheduler questions, 1.1 nightly build.

2010-04-22 Thread Phil Barnett
I'm having a problem where shouldfetch is rejecting everything. I have deleted the crawl directory and started the entire crawl from scratch by rm -rf crawl mkdir crawl mkdir segments I'm absolutely baffled by how this scheduler works. Is there documentation? Is the fetchtime saved somewhere ot

Re: Scheduler questions, 1.1 nightly build.

2010-04-22 Thread Phil Barnett
I should add that what I really want to do is toss all previous crawl information and reindex everything every night. It's just a few servers and very low impact. My crawl on 1.0 takes about 10 minutes. On Thu, Apr 22, 2010 at 4:59 AM, Phil Barnett wrote: > I'm having a

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-28 Thread Phil Barnett
ing a service to newcomers to bring documentation in line with current offerings. This is not trivial code and it takes a long time for someone from the outside to understand it. That process is being stifled on multiple fronts as far as I can see. Either that or I have missed an important document that exists and I haven't read it. Phil Barnett Senior Programmer / Analyst Walt Disney World, Inc.

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-30 Thread Phil Barnett
On Wed, Apr 28, 2010 at 11:01 AM, Mattmann, Chris A (388J) < chris.a.mattm...@jpl.nasa.gov> wrote: > > Unfortunately some parts of the documentation on Nutch (namely the > tutorial, > and other parts of the static site) have been out of date for a while. This > has occurred really independent of t

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-30 Thread Phil Barnett
On Wed, Apr 28, 2010 at 10:27 AM, matthew a. grisius wrote: > I also share many of Phil's sentiments. I really want the project > (bin/nutch crawl) to work for me as well and I want to help somehow. I > would like to share a 5gb 'intranet' web site with ~50 people. And I > have not graduated to ma

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-30 Thread Phil Barnett
Oh yeah, I built a presentation and gave it to our local Linux User Group meeting. You might find it useful: http://leap-cf.org/presentations/nutch/NutchWebCrawler.odp On Sat, May 1, 2010 at 2:10 AM, Phil Barnett wrote: > > > On Wed, Apr 28, 2010 at 10:27 AM, matthew a. grisius

Re: nutch crawl issue

2010-04-30 Thread Phil Barnett
This sounds exactly like what I have been experiencing. On Wed, Apr 28, 2010 at 12:39 AM, matthew a. grisius wrote: > using Nutch nightly build nutch-2010-04-27_04-00-28: > > I am trying to bin/nutch crawl a single html file generated by javadoc > and no links are followed. I verified this with b

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-05-01 Thread Phil Barnett
On Sat, May 1, 2010 at 2:34 AM, Mattmann, Chris A (388J) < chris.a.mattm...@jpl.nasa.gov> wrote: > > Sure, hopefully you'll find the answer you're looking for. In the > meanwhile, > it's my job to keep cutting release candidates as the RM, that at least > pass > the basic criteria for release and