RE: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1

j.sullivan Wed, 14 Nov 2012 19:32:41 -0800

Markus, I was mistakenly thinking of a doc field with a similar name. Thanks 
for pointing that out.


-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: Tuesday, November 13, 2012 7:24 PM
To: [email protected]
Subject: RE: How to find ids of pages that have been newly crawled or modified 
after a given date with Nutch 2.1

In trunk the modified time is based on whether or not the signature has 
changed. It makes little sense relying on HTTP headers because almost no CMS 
implements it correctly and it messes (or allows to be messed with on purpose) 
with an adaptive schedule.

https://issues.apache.org/jira/browse/NUTCH-1341
 
 
-----Original message-----
> From:[email protected] <[email protected]>
> Sent: Tue 13-Nov-2012 11:13
> To: [email protected]
> Subject: RE: How to find ids of pages that have been newly crawled or 
> modified after a given date with Nutch 2.1
> 
> I think the modifiedTime comes from the http headers if available, if not it 
> is left empty.  In other words it is the time the content was last modified 
> according to the source if available and if not available it is left blank.  
> Depending on what Jacob is trying to achieve the one line patch at 
> https://issues.apache.org/jira/browse/NUTCH-1475 might be what he needs (or 
> might not be).
> 
> James
> 
> -----Original Message-----
> From: Ferdy Galema [mailto:[email protected]]
> Sent: Tuesday, November 13, 2012 6:31 PM
> To: [email protected]
> Subject: Re: How to find ids of pages that have been newly crawled or 
> modified after a given date with Nutch 2.1
> 
> Hi,
> 
> There might be something wrong with the field modifiedTime. I'm not sure how 
> well you can rely on this field (with the default or the adaptive scheduler).
> 
> If you want to get to the bottom of this, I suggest debugging or running 
> small crawls to test the behaviour. In case something doesn't work as 
> expected, please repost here or open a Jira.
> 
> Ferdy.
> 
> On Mon, Nov 12, 2012 at 8:18 PM, Jacob Sisk <[email protected]> wrote:
> 
> > Hi,
> >
> > If this question has already been answered please forgive me and 
> > point me to the appropriate thread.
> >
> > I'd like to be able to find the ids of all new pages crawled by 
> > nutch or pages modified since a fixed point in the past.
> >
> > I'm using Nutch 2.1 with MySQL as the back-end and it seems like the 
> > appropriate back-end query should be something like:
> >
> >  "select id from webpage where (prevFetchTime=null & fetchTime>="X") 
> > or (modifiedTime >= "X" )
> >
> > where "X" is some point in the past.
> >
> > What I've found is that modifiedTime is always null.  I am using the
> > adaptive scheduler and the default md5 signature class.   I've tried both
> > re-injecting seed URLs as well as not, it seems to make no difference.
> >  modifiedTime remains null.
> >
> > I am most grateful for any help or advise.  If my nutc-hsite.xml 
> > fiel would help I can forward it along.
> >
> > Thanks,
> > jacob
> >
>

RE: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1

Reply via email to