Hi again, Maybe you could try getting differential logs of the chat server, if possible. If you are handling chat server, you could set log rotation for 10 mins. for instance, and then add those as if they are different web pages.
Or, you should check db.fetch.interval.* values and probably your key to solve is writing a custom class and use it as db.fetch.schedule.class and db.signature.class. After all, you need to know, which page should be scheduled how, and which page is actually modified. Best 2011/8/3 Christian Weiske <[email protected]> > Hi, > > > I'd like to crawl pages of chat logs that change whenever someone sends > a message in our chat rooms, which happens every couple of seconds. > The HTML log pages are updated instantly by the prosody jabber server > and thus have always current timestamps. > > Nutch seems to reject them now because they are too new: > > > -shouldFetch rejected > > 'http://conference.nr:5290/muc_log/', > > fetchTime=1314950217363, curTime=1312358255779 > > > I have two questions: > > 1. Which timestamp format is that? They don't seem to be unix > timestamps, because > > $ php -r 'echo date("Y-m-d H:i:s", 1312358255779);' > > 43556-12-23 16:56:19 > is the wrong year :) > > 2. What can I do to not get those URLs rejected? I already tried to set > > db.fetch.schedule.adaptive.sync_delta > to false and > > db.fetch.schedule.adaptive.inc_rate > > db.fetch.schedule.adaptive.dec_rate > to 0, but that does not help. > > -- > Viele Grüße > Christian Weiske >

