Hi again,

Maybe you could try getting differential logs of the chat server, if
possible. If you are handling chat server, you could set log rotation for 10
mins. for instance, and then add those as if they are different web pages.

Or, you should check db.fetch.interval.* values and probably your key to
solve is writing a custom class and use it as db.fetch.schedule.class and
db.signature.class. After all, you need to know, which page should be
scheduled how, and which page is actually modified.

Best

2011/8/3 Christian Weiske <[email protected]>

> Hi,
>
>
> I'd like to crawl pages of chat logs that change whenever someone sends
> a message in our chat rooms, which happens every couple of seconds.
> The HTML log pages are updated instantly by the prosody jabber server
> and thus have always current timestamps.
>
> Nutch seems to reject them now because they are too new:
>
> > -shouldFetch rejected
> >  'http://conference.nr:5290/muc_log/',
> >  fetchTime=1314950217363, curTime=1312358255779
>
>
> I have two questions:
>
> 1. Which timestamp format is that? They don't seem to be unix
> timestamps, because
> > $ php -r 'echo date("Y-m-d H:i:s", 1312358255779);'
> > 43556-12-23 16:56:19
> is the wrong year :)
>
> 2. What can I do to not get those URLs rejected? I already tried to set
>   > db.fetch.schedule.adaptive.sync_delta
>   to false and
>   > db.fetch.schedule.adaptive.inc_rate
>   > db.fetch.schedule.adaptive.dec_rate
>   to 0, but that does not help.
>
> --
> Viele Grüße
> Christian Weiske
>

Reply via email to