Hi,

If this question has already been answered please forgive me and point me
to the appropriate thread.

I'd like to be able to find the ids of all new pages crawled by nutch or
pages modified since a fixed point in the past.

I'm using Nutch 2.1 with MySQL as the back-end and it seems like the
appropriate back-end query should be something like:

 "select id from webpage where (prevFetchTime=null & fetchTime>="X") or
(modifiedTime >= "X" )

where "X" is some point in the past.

What I've found is that modifiedTime is always null.  I am using the
adaptive scheduler and the default md5 signature class.   I've tried both
re-injecting seed URLs as well as not, it seems to make no difference.
 modifiedTime remains null.

I am most grateful for any help or advise.  If my nutc-hsite.xml fiel would
help I can forward it along.

Thanks,
jacob

Reply via email to