Re: About Apache Nutch 1.1 Final Release

2010-04-10 Thread Phil Barnett
at home, the setup is at work and I had to revert to get things back running. But I built a dev machine so I can play with 1.1 and get more specific. Phil Barnett Senior Analyst Walt Disney World.

Re: About Apache Nutch 1.1 Final Release

2010-04-10 Thread Phil Barnett
On Sat, 2010-04-10 at 18:22 +0200, Andrzej Bialecki wrote: On 2010-04-10 17:49, Phil Barnett wrote: On Thu, 2010-04-08 at 21:31 -0700, Mattmann, Chris A (388J) wrote: Hi there, Well as soon as we have 3 +1 binding VOTEs. Right now I'm the only PMC member that's VOTE'd +1 on the release

Re: About Apache Nutch 1.1 Final Release

2010-04-14 Thread Phil Barnett
On Sat, Apr 10, 2010 at 11:04 PM, Phil Barnett ph...@philb.us wrote: On Sat, 2010-04-10 at 18:22 +0200, Andrzej Bialecki wrote: On 2010-04-10 17:49, Phil Barnett wrote: On Thu, 2010-04-08 at 21:31 -0700, Mattmann, Chris A (388J) wrote: Hi there, Well as soon as we have 3 +1 binding

Re: nutch 1.1 crawl d/n complete issue

2010-04-15 Thread Phil Barnett
On Thu, 2010-04-15 at 15:34 -0400, matthew a. grisius wrote: Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Exception in thread main java.lang.NullPointerException at

Re: nutch 1.1 crawl d/n complete issue

2010-04-16 Thread Phil Barnett
The Fix. In line 131 of Crawl.java Generate no longer returns segments like it used to. Now it returns segs. line 131 needs to read If (segs == null) Instead of the current If (segments == null) After that change and a recompile, crawl is working just fine.

Re: About Apache Nutch 1.1 Final Release

2010-04-16 Thread Phil Barnett
On Sat, 2010-04-10 at 18:22 +0200, Andrzej Bialecki wrote: More details on this (your environment, OS, JDK version) and logs/stacktraces would be highly appreciated! You mentioned that you have some scripts - if you could extract relevant portions from them (or copy the scripts) it would help

Question about crawler.

2010-04-20 Thread Phil Barnett
Is there some place to tell why the crawler has rejected a page? I'm trying to get 1.1 working and basically it doesn't seem to crawl the same way that 1.0 does. I have tika included in the parse- section of conf/nutch-site.xml I have DEBUG set for all the crawl sections, but it doesn't really

Re: Question about crawler.

2010-04-20 Thread Phil Barnett
On Tue, Apr 20, 2010 at 7:02 PM, arkadi.kosmy...@csiro.au wrote: Hi Phil, -Original Message- From: Phil Barnett [mailto:ph...@philb.us] Sent: Wednesday, 21 April 2010 8:39 AM To: nutch-user@lucene.apache.org Subject: Question about crawler. Is there some place to tell why

Re: Question about crawler.

2010-04-20 Thread Phil Barnett
I meant the production 1.0 server is still crawling them.

conf questions

2010-04-20 Thread Phil Barnett
What's the difference between regex-urlfilters.xml and crawl-urlfilters.xml. What uses what?

Scheduler questions, 1.1 nightly build.

2010-04-22 Thread Phil Barnett
I'm having a problem where shouldfetch is rejecting everything. I have deleted the crawl directory and started the entire crawl from scratch by rm -rf crawl mkdir crawl mkdir segments I'm absolutely baffled by how this scheduler works. Is there documentation? Is the fetchtime saved somewhere

Re: Scheduler questions, 1.1 nightly build.

2010-04-22 Thread Phil Barnett
I should add that what I really want to do is toss all previous crawl information and reindex everything every night. It's just a few servers and very low impact. My crawl on 1.0 takes about 10 minutes. On Thu, Apr 22, 2010 at 4:59 AM, Phil Barnett ph...@philb.us wrote: I'm having a problem

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-28 Thread Phil Barnett
a long time for someone from the outside to understand it. That process is being stifled on multiple fronts as far as I can see. Either that or I have missed an important document that exists and I haven't read it. Phil Barnett Senior Programmer / Analyst Walt Disney World, Inc.

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-30 Thread Phil Barnett
On Wed, Apr 28, 2010 at 11:01 AM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Unfortunately some parts of the documentation on Nutch (namely the tutorial, and other parts of the static site) have been out of date for a while. This has occurred really independent of the

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-05-01 Thread Phil Barnett
On Wed, Apr 28, 2010 at 10:27 AM, matthew a. grisius mgris...@comcast.netwrote: I also share many of Phil's sentiments. I really want the project (bin/nutch crawl) to work for me as well and I want to help somehow. I would like to share a 5gb 'intranet' web site with ~50 people. And I have

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-05-01 Thread Phil Barnett
Oh yeah, I built a presentation and gave it to our local Linux User Group meeting. You might find it useful: http://leap-cf.org/presentations/nutch/NutchWebCrawler.odp On Sat, May 1, 2010 at 2:10 AM, Phil Barnett ph...@philb.us wrote: On Wed, Apr 28, 2010 at 10:27 AM, matthew a. grisius

Re: nutch crawl issue

2010-05-01 Thread Phil Barnett
This sounds exactly like what I have been experiencing. On Wed, Apr 28, 2010 at 12:39 AM, matthew a. grisius mgris...@comcast.netwrote: using Nutch nightly build nutch-2010-04-27_04-00-28: I am trying to bin/nutch crawl a single html file generated by javadoc and no links are followed. I

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-05-01 Thread Phil Barnett
On Sat, May 1, 2010 at 2:34 AM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Sure, hopefully you'll find the answer you're looking for. In the meanwhile, it's my job to keep cutting release candidates as the RM, that at least pass the basic criteria for release and right