Re: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1

Jacob Sisk Tue, 13 Nov 2012 11:32:14 -0800

Hi folks,

Thanks for all of your suggestions.  Here are two tentative fixes suggested
by my colleagues at work:


Fix 1:
Within Nutch itself,  in org.apache.nutch.crawl.DbUpDateReducer  change
line 129 to:

long modifiedTime = (modified == FetchSchedule.STATUS_MODIFIED) ?
System.currentTimeMillis() : page.getModifiedTime();

Fix (or really workaround) 2:
Alter the webpage table to in mysql to contain a column

update_ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP

Define a trigger as follows:

DELIMITER //
 CREATE TRIGGER updtrigger BEFORE UPDATE ON webpage
     FOR EACH ROW
     BEGIN
     IF NEW.signature <> OLD.signature THEN
     SET NEW.update_ts = NOW();
     END IF;
     END
     //


I think the first is the lesser of evils and it seems like it works,
but I don't know enough about Nutch to determine if this is an abuse
of the semantics of the modifiedTime field.  I'd love your $0.02.

Thanks,
jacob






On Tue, Nov 13, 2012 at 5:24 AM, Markus Jelsma
<[email protected]>wrote:

> In trunk the modified time is based on whether or not the signature has
> changed. It makes little sense relying on HTTP headers because almost no
> CMS implements it correctly and it messes (or allows to be messed with on
> purpose) with an adaptive schedule.
>
> https://issues.apache.org/jira/browse/NUTCH-1341
>
>
> -----Original message-----
> > From:[email protected] <[email protected]>
> > Sent: Tue 13-Nov-2012 11:13
> > To: [email protected]
> > Subject: RE: How to find ids of pages that have been newly crawled or
> modified after a given date with Nutch 2.1
> >
> > I think the modifiedTime comes from the http headers if available, if
> not it is left empty.  In other words it is the time the content was last
> modified according to the source if available and if not available it is
> left blank.  Depending on what Jacob is trying to achieve the one line
> patch at https://issues.apache.org/jira/browse/NUTCH-1475 might be what
> he needs (or might not be).
> >
> > James
> >
> > -----Original Message-----
> > From: Ferdy Galema [mailto:[email protected]]
> > Sent: Tuesday, November 13, 2012 6:31 PM
> > To: [email protected]
> > Subject: Re: How to find ids of pages that have been newly crawled or
> modified after a given date with Nutch 2.1
> >
> > Hi,
> >
> > There might be something wrong with the field modifiedTime. I'm not sure
> how well you can rely on this field (with the default or the adaptive
> scheduler).
> >
> > If you want to get to the bottom of this, I suggest debugging or running
> small crawls to test the behaviour. In case something doesn't work as
> expected, please repost here or open a Jira.
> >
> > Ferdy.
> >
> > On Mon, Nov 12, 2012 at 8:18 PM, Jacob Sisk <[email protected]>
> wrote:
> >
> > > Hi,
> > >
> > > If this question has already been answered please forgive me and point
> > > me to the appropriate thread.
> > >
> > > I'd like to be able to find the ids of all new pages crawled by nutch
> > > or pages modified since a fixed point in the past.
> > >
> > > I'm using Nutch 2.1 with MySQL as the back-end and it seems like the
> > > appropriate back-end query should be something like:
> > >
> > >  "select id from webpage where (prevFetchTime=null & fetchTime>="X")
> > > or (modifiedTime >= "X" )
> > >
> > > where "X" is some point in the past.
> > >
> > > What I've found is that modifiedTime is always null.  I am using the
> > > adaptive scheduler and the default md5 signature class.   I've tried
> both
> > > re-injecting seed URLs as well as not, it seems to make no difference.
> > >  modifiedTime remains null.
> > >
> > > I am most grateful for any help or advise.  If my nutc-hsite.xml fiel
> > > would help I can forward it along.
> > >
> > > Thanks,
> > > jacob
> > >
> >
>

Re: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1

Reply via email to