on't have more specifics, but I'm at home, the setup is at work
and I had to revert to get things back running. But I built a dev
machine so I can play with 1.1 and get more specific.
Phil Barnett
Senior Analyst
Walt Disney World.
On Sat, 2010-04-10 at 18:22 +0200, Andrzej Bialecki wrote:
> On 2010-04-10 17:49, Phil Barnett wrote:
> > On Thu, 2010-04-08 at 21:31 -0700, Mattmann, Chris A (388J) wrote:
> >> Hi there,
> >>
> >> Well as soon as we have 3 +1 binding VOTEs. Right now I'm t
On Sat, Apr 10, 2010 at 11:04 PM, Phil Barnett wrote:
> On Sat, 2010-04-10 at 18:22 +0200, Andrzej Bialecki wrote:
> > On 2010-04-10 17:49, Phil Barnett wrote:
> > > On Thu, 2010-04-08 at 21:31 -0700, Mattmann, Chris A (388J) wrote:
> > >> Hi there,
> > &g
On Thu, 2010-04-15 at 15:34 -0400, matthew a. grisius wrote:
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Exception in thread "main" java.lang.NullPointerException
> at org.apache.nutch.crawl.Crawl.main(Cr
The Fix.
In line 131 of Crawl.java
Generate no longer returns segments like it used to. Now it returns segs.
line 131 needs to read
If (segs == null)
Instead of the current
If (segments == null)
After that change and a recompile, crawl is working just fine.
On Sat, 2010-04-10 at 18:22 +0200, Andrzej Bialecki wrote:
> More details on this (your environment, OS, JDK version) and
> logs/stacktraces would be highly appreciated! You mentioned that you
> have some scripts - if you could extract relevant portions from them (or
> copy the scripts) it would h
Is there some place to tell why the crawler has rejected a page? I'm trying
to get 1.1 working and basically it doesn't seem to crawl the same way that
1.0 does.
I have tika included in the parse- section of conf/nutch-site.xml
I have DEBUG set for all the crawl sections, but it doesn't really sa
On Tue, Apr 20, 2010 at 7:02 PM, wrote:
> Hi Phil,
>
> > -Original Message-
> > From: Phil Barnett [mailto:ph...@philb.us]
> > Sent: Wednesday, 21 April 2010 8:39 AM
> > To: nutch-user@lucene.apache.org
> > Subject: Question about crawler.
> &g
I meant the production 1.0 server is still crawling them.
What's the difference between regex-urlfilters.xml and crawl-urlfilters.xml.
What uses what?
I'm having a problem where shouldfetch is rejecting everything. I have
deleted the crawl directory and started the entire crawl from scratch by
rm -rf crawl
mkdir crawl
mkdir segments
I'm absolutely baffled by how this scheduler works.
Is there documentation?
Is the fetchtime saved somewhere ot
I should add that what I really want to do is toss all previous crawl
information and reindex everything every night. It's just a few servers and
very low impact. My crawl on 1.0 takes about 10 minutes.
On Thu, Apr 22, 2010 at 4:59 AM, Phil Barnett wrote:
> I'm having a
ing a service to newcomers to bring
documentation in line with current offerings. This is not trivial code and
it takes a long time for someone from the outside to understand it. That
process is being stifled on multiple fronts as far as I can see. Either that
or I have missed an important document that exists and I haven't read it.
Phil Barnett
Senior Programmer / Analyst
Walt Disney World, Inc.
On Wed, Apr 28, 2010 at 11:01 AM, Mattmann, Chris A (388J) <
chris.a.mattm...@jpl.nasa.gov> wrote:
>
> Unfortunately some parts of the documentation on Nutch (namely the
> tutorial,
> and other parts of the static site) have been out of date for a while. This
> has occurred really independent of t
On Wed, Apr 28, 2010 at 10:27 AM, matthew a. grisius
wrote:
> I also share many of Phil's sentiments. I really want the project
> (bin/nutch crawl) to work for me as well and I want to help somehow. I
> would like to share a 5gb 'intranet' web site with ~50 people. And I
> have not graduated to ma
Oh yeah, I built a presentation and gave it to our local Linux User Group
meeting. You might find it useful:
http://leap-cf.org/presentations/nutch/NutchWebCrawler.odp
On Sat, May 1, 2010 at 2:10 AM, Phil Barnett wrote:
>
>
> On Wed, Apr 28, 2010 at 10:27 AM, matthew a. grisius
This sounds exactly like what I have been experiencing.
On Wed, Apr 28, 2010 at 12:39 AM, matthew a. grisius
wrote:
> using Nutch nightly build nutch-2010-04-27_04-00-28:
>
> I am trying to bin/nutch crawl a single html file generated by javadoc
> and no links are followed. I verified this with b
On Sat, May 1, 2010 at 2:34 AM, Mattmann, Chris A (388J) <
chris.a.mattm...@jpl.nasa.gov> wrote:
>
> Sure, hopefully you'll find the answer you're looking for. In the
> meanwhile,
> it's my job to keep cutting release candidates as the RM, that at least
> pass
> the basic criteria for release and
18 matches
Mail list logo