Hi,

There might be something wrong with the field modifiedTime. I'm not sure
how well you can rely on this field (with the default or the adaptive
scheduler).

If you want to get to the bottom of this, I suggest debugging or running
small crawls to test the behaviour. In case something doesn't work as
expected, please repost here or open a Jira.

Ferdy.

On Mon, Nov 12, 2012 at 8:18 PM, Jacob Sisk <[email protected]> wrote:

> Hi,
>
> If this question has already been answered please forgive me and point me
> to the appropriate thread.
>
> I'd like to be able to find the ids of all new pages crawled by nutch or
> pages modified since a fixed point in the past.
>
> I'm using Nutch 2.1 with MySQL as the back-end and it seems like the
> appropriate back-end query should be something like:
>
>  "select id from webpage where (prevFetchTime=null & fetchTime>="X") or
> (modifiedTime >= "X" )
>
> where "X" is some point in the past.
>
> What I've found is that modifiedTime is always null.  I am using the
> adaptive scheduler and the default md5 signature class.   I've tried both
> re-injecting seed URLs as well as not, it seems to make no difference.
>  modifiedTime remains null.
>
> I am most grateful for any help or advise.  If my nutc-hsite.xml fiel would
> help I can forward it along.
>
> Thanks,
> jacob
>

Reply via email to