Hi folks,
Thanks for all of your suggestions. Here are two tentative fixes suggested
by my colleagues at work:
Fix 1:
Within Nutch itself, in org.apache.nutch.crawl.DbUpDateReducer change
line 129 to:
long modifiedTime = (modified == FetchSchedule.STATUS_MODIFIED) ?
System.currentTimeMillis() : page.getModifiedTime();
Fix (or really workaround) 2:
Alter the webpage table to in mysql to contain a column
update_ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP
Define a trigger as follows:
DELIMITER //
CREATE TRIGGER updtrigger BEFORE UPDATE ON webpage
FOR EACH ROW
BEGIN
IF NEW.signature <> OLD.signature THEN
SET NEW.update_ts = NOW();
END IF;
END
//
I think the first is the lesser of evils and it seems like it works,
but I don't know enough about Nutch to determine if this is an abuse
of the semantics of the modifiedTime field. I'd love your $0.02.
Thanks,
jacob
On Tue, Nov 13, 2012 at 5:24 AM, Markus Jelsma
<[email protected]>wrote:
> In trunk the modified time is based on whether or not the signature has
> changed. It makes little sense relying on HTTP headers because almost no
> CMS implements it correctly and it messes (or allows to be messed with on
> purpose) with an adaptive schedule.
>
> https://issues.apache.org/jira/browse/NUTCH-1341
>
>
> -----Original message-----
> > From:[email protected] <[email protected]>
> > Sent: Tue 13-Nov-2012 11:13
> > To: [email protected]
> > Subject: RE: How to find ids of pages that have been newly crawled or
> modified after a given date with Nutch 2.1
> >
> > I think the modifiedTime comes from the http headers if available, if
> not it is left empty. In other words it is the time the content was last
> modified according to the source if available and if not available it is
> left blank. Depending on what Jacob is trying to achieve the one line
> patch at https://issues.apache.org/jira/browse/NUTCH-1475 might be what
> he needs (or might not be).
> >
> > James
> >
> > -----Original Message-----
> > From: Ferdy Galema [mailto:[email protected]]
> > Sent: Tuesday, November 13, 2012 6:31 PM
> > To: [email protected]
> > Subject: Re: How to find ids of pages that have been newly crawled or
> modified after a given date with Nutch 2.1
> >
> > Hi,
> >
> > There might be something wrong with the field modifiedTime. I'm not sure
> how well you can rely on this field (with the default or the adaptive
> scheduler).
> >
> > If you want to get to the bottom of this, I suggest debugging or running
> small crawls to test the behaviour. In case something doesn't work as
> expected, please repost here or open a Jira.
> >
> > Ferdy.
> >
> > On Mon, Nov 12, 2012 at 8:18 PM, Jacob Sisk <[email protected]>
> wrote:
> >
> > > Hi,
> > >
> > > If this question has already been answered please forgive me and point
> > > me to the appropriate thread.
> > >
> > > I'd like to be able to find the ids of all new pages crawled by nutch
> > > or pages modified since a fixed point in the past.
> > >
> > > I'm using Nutch 2.1 with MySQL as the back-end and it seems like the
> > > appropriate back-end query should be something like:
> > >
> > > "select id from webpage where (prevFetchTime=null & fetchTime>="X")
> > > or (modifiedTime >= "X" )
> > >
> > > where "X" is some point in the past.
> > >
> > > What I've found is that modifiedTime is always null. I am using the
> > > adaptive scheduler and the default md5 signature class. I've tried
> both
> > > re-injecting seed URLs as well as not, it seems to make no difference.
> > > modifiedTime remains null.
> > >
> > > I am most grateful for any help or advise. If my nutc-hsite.xml fiel
> > > would help I can forward it along.
> > >
> > > Thanks,
> > > jacob
> > >
> >
>